In this project, we uncover high-level and granular insights that will inform the strategic decisions of publishers, namely publishing agents, financiers, and strategists. This exploratory data analysis (EDA) in SQLite and Python focuses on the primary sources and topical trends of publisher revenue.
- Top publishers, tight market: The five leading publishers - Penguin Group and Random House followed by Amazon Digital, Hachette, and HarperCollins - are tightly grouped together by narrow margins.
- Genre performance: Fiction has a significant (10x) lead over non-fiction. Children's books has a tiny share of publisher revenue, but at times the highest revenue percentages (60-62%).
- Publishing year peaks: Books published between approximately 2009 - 2012 yield the highest volume of units sold but relatively low and erratic revenue percentages (~28-53%).
This dashboard visualizes a comparative analysis of publishers and granular presentation of the primary sources of publisher revenue. Stakeholders may examine these trends and data points by utilizing the filters (publisher, publishing year, and author) and selecting various portions (for example, click the 2011 pubishing year bubble, HarperCollins bar, or a row in the author profile).
The Python and SQL exploratory data analysis (EDA) demonstrates the ETL process that led to the featured insights and offers additional points of analysis for stakeholders in RevOps, product, and/or marketing.
- The units sold for certain books and publishers, particularly Amazon Digital, are very high (10K+) but the gross sales and publisher revenue are relatively low. A number of these books are likely available as Kindle/Kindle Unlimited products as a part of the Amazon Prime membership; books that are selected and accessed by users through their Amazon account may be logged as units sold.
- The original dataset does not include dates beyond the publishing year of the books. For further analysis, particularly time-series and predictive models, it would be necessary and beneficial to integrate datasets with temporal sales transactions.
- Publishing Year: The year in which each book was published ranging from 1308 - 2016, with nulls
- Book Name: The title of each book.
- Author: The name of the author who wrote the book.
- Language_code: The code representing the language in which the book is written.
- Author_Rating: The rating assigned to the author based on their previous works.
- Book_average_rating: The average rating given to the book by readers.
- Book_ratings_count: The number of ratings given to the book by readers.
- Genre: The genre or category to which the book belongs.
- Gross sales: The total sales revenue generated by each book.
- Publisher revenue: The revenue earned by publishers from selling each book.
- Sale price: The price at which each copy of a book is sold.
- Sales rank: A numeric value indicating a book's rank based on its sales performance in comparison to other books within its category (genre).
- Units sold : Total number of copies sold for each specific title.