About the dataset: This dataset contains information about top selling books of Amazon(550 books) either belonging to fiction or non-fiction.
The data I'll be using in this data science project is a dataset of Amazon's 50 Best Books between 2009 and 2019.
Import and Read the Amazon Best Selling Book dataset and Store it in a variable called data
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("bestsellers with categories.csv")
data.head()
Inspect the dataframe's columns, shapes, variable types, unique values etc.
r,c = data.shape
print(f"The dataset has {r} rows and {c} columns.")
data.dtypes
data.info()
data.nunique()
data.describe()
Note: It is clear from the above table that maximum price of a book is 105$ and maximum rating is 4.9.
Find out the number of Null values in all the columns and rows
data.isnull().sum()
data.isnull().sum(axis=1).sort_values(ascending=False)
data[data.duplicated()]
len(data[data.duplicated()])
Note: There is no duplicate row found in the dataset
len(data.Name.unique())
Note: Total number of unique books are 351 but total rows are 550.
This is clear if we see the dataframe that some books are repeatedly among the top sellers in different years.
data[['Name','Author']].duplicated().sum()
### Subtask 3.1: Find the Book with highest profit
Profit
which contains the product of the two columns: Reviews
and Price
.profit
column as reference.top10
data['Profit'] = data['Reviews'] * data['Price']
data
Top10 = data.sort_values(by='Profit',ascending=False).head(10)
Top10
After you found out the top 10 profiting books, you might have notice a duplicate value. So, it seems like the dataframe has duplicate values as well. Group the dataframe by name and repeat Subtask 3.1
.
Top10 = data.groupby('Name')[['Profit']].max().sort_values(by='Profit',ascending=False).head(10)
Top10
sns.barplot(x=Top10.Profit,y=Top10.index.values)
plt.xlabel("Earned(millions)")
plt.ylabel("Books")
plt.title('Earning by Books')
data.Genre.unique()
data.Genre.value_counts()
sns.countplot(x="Genre", data=data)
plt.show()
plt.pie(data.Genre.value_counts(),labels=['Non Fiction','Fiction'],autopct='%.0f%%');
Non Fiction books are in majority in top selling category.
data.groupby('Genre')['User Rating'].mean()
sns.histplot(data=data['User Rating'],bins=10)
plt.xlabel("Ratings")
sns.lineplot(y=data['User Rating'],x=data['Year'],hue=data['Genre']);
plt.ylabel("Ratings")
plt.xlabel("Years");
It is clear from the above graph that most of the books received ratings between 4.5 to 4.9.
sns.lmplot(y='User Rating',x='Price',data=data)
plt.ylabel('Ratings')
plt.xlabel('Price');
Note: As we seen price increases, rating goes down.
data.groupby('Genre')['User Rating'].mean()
This graph shows that there is not a significant relationship between price and ratings but with increasing price , the ratings are falling for both Fiction and Non fiction.
data.groupby('Year').Profit.sum().sort_values(ascending=False)
data.groupby('Year').Profit.sum().plot(kind='bar')
plt.show()
data.groupby(['Year','Genre']).Profit.sum().sort_values(ascending=False)
sns.barplot(x=data['Year'],y=data['Profit'],hue=data['Genre'])
plt.xlabel('Years')
plt.ylabel("Eearned(millions)")
plt.title('Money earned each year');
Above graph above shows the earning of these top books ,in most cases, are below 0.8 million.
genre_average=data.groupby(['Genre'])['Profit'].mean()
genre_average
sns.barplot(x=genre_average.index,y=genre_average);
data.groupby('Year')['Profit'].max()
data.groupby('Year')['Profit'].transform(max)
data[data.groupby('Year')['Profit'].transform(max) == data['Profit']]
most_earning_book_per_year=data[data.groupby('Year')['Profit'].transform(max) == data['Profit']]
most_earning_book_per_year
most_earning_book_per_year=most_earning_book_per_year.sort_values('Year').set_index('Year')
most_earning_book_per_year
genres_per_year_mean=data.groupby(['Year','Genre'])[['Profit']].mean().round(2)
genres_per_year_mean
Earning_Graph=data.groupby('Year')['Profit'].sum()
Earning_Graph
sns.lineplot(data=Earning_Graph)
plt.xlabel('Year')
plt.ylabel("Earned")
plt.title("EARNING PER YEAR")
plt.figure(figsize=(12,12));
authors=data.groupby('Author')['Profit'].sum().sort_values(ascending=False).head(10)
authors
sns.barplot(y=authors.index,x=authors)
plt.title('The Money Makers ')