Through this project, I practiced the complete process of data mining:
Loading and cleaning a real dataset,
Exploring both numerical and categorical variables,
Visualizing distributions and outliers,
Normalizing data for better comparison,
Connecting features to real insights.
Data Preparation
First, I loaded the CSV file into Pandas and checked its shape. The dataset contained some faulty rows, which I removed. Then I checked for missing values and confirmed there were none.
I listed the column names and types: integers, floats, and objects. After that, I used the describe() method to calculate important statistics such as minimum, maximum, mean, and standard deviation.
Visualization
To better understand the data, I created histograms for continuous variables and boxplots to detect outliers. For example, I found some very large values in ratings_count (over 4–5 million), which showed strong outliers.
To compare variables on the same scale, I applied normalization with min–max scaling. I plotted the normalized data again with boxplots, which made it easier to see patterns.
Categorical Analysis
I also explored the categorical variables:
The most repeated book title was Salem’s Lot (Korku Ağı), which appeared 11 times.
Agatha Christie was the author with the most books (69 titles).
Most of the books were in English (10,594 out of 13,714).
I created a pie chart to show the top 7 languages: English, British English, Spanish, American English, German, French, and Japanese.
Author and Rating Analysis
Next, I studied the authors and ratings:
I grouped the data by authors to find the top 10 authors with the most books and showed the results in a bar chart.
I compared the top 10 authors with the highest average ratings and the bottom 5 authors with the lowest ratings.
Using scatterplots and jointplots, I checked the relationship between:
Number of text reviews and average rating → books with more reviews usually had a higher rating.
Number of ratings and average rating → books with more ratings (sometimes over 1 million) also tended to have a higher rating.
Key Findings
From this project, I found that:
Harry Potter books were among the top 5 most read.
Agatha Christie is the most productive author.
The dataset is strongly dominated by English-language books.
Both review count and rating count are positively connected with higher average ratings.
This study gave me valuable experience in using Python for data science and helped me understand how to find meaningful patterns in large datasets.