How to build a book recommendation system in Python

Key takeaways:

  • A book recommendation system leverages machine learning to suggest books based on user preferences.

  • It starts with defining a dataset that includes book IDs, titles, authors, and ratings.

  • Essential Python libraries include pandas, plotly, and scikit-learn for data manipulation and visualization.

  • Visualization techniques, like histograms and bar graphs, help analyze book ratings and author contributions.

  • The average rating data is vectorized using TF-IDF to enable similarity calculations.

  • Cosine similarity measures the similarity between books, providing personalized recommendations.

  • Users can input a title to receive a list of similar books for better discovery.

A book recommendation system is software designed to help readers get book recommendations according to their reading history and interests. It uses different data science and machine learning algorithms to generate more customized results based on user preferences, thus enhancing the reading experience for the users.

In this Answer, we will use machine learning expertise to deeply understand the user’s history, taste, and reviews of books to generate the most similar books to the user’s preference.

Book recommendation process

The book recommendation process contains a series of steps, starting from defining the datasets, plotting data based on the dataset, vectorizing the data, and calculating the similarity of the books. Here is a step-by-step distribution of the recommendation process:

Defining the dataset

The dataset used for this process contains a series of books along with their IDs, titles, authors, and average ratings. We will store the dataset in the books_data.csv file. We will use the average rating of the books and the information on authors to generate customized suggestions for users.

Installing the dependencies

To perform the book recommendation process, certain dependencies are required. In Python3, we use pip3 to install the required dependencies. For this specific process, we require pandas, plotly, and scikit_learn. To install the dependencies, we use the following command:

pip3 install pandas
pip3 install plotly
pip3 install scikit_learn
pip3 install nltk
Install the libraries

Importing the libraries

The next step is to import libraries in the Python code, we need TfidVectorizer and linear_kernel from sklearn. To import the dependences use the following statements:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import plotly.express as px

In the code:

Data loading

It is important to read the data and get information about its content. To do that, we use the following code:

book_data = pd.read_csv("books_data.csv")
book_data.info()

In the code:

  • Line 1: Read the dataset from the .csv file, books_data.csv.

  • Line 2: Get information on the columns of the data.

Plotting

We use sorts of plotting in this process, frequency distribution against the average_rating, to do this, we use the following code:

freq_dist = px.histogram(book_data, x='average_rating', nbins=3,
title='Average Ratings Frequency Distribution')
freq_dist.update_xaxes(title_text='Average Rating')
freq_dist.update_yaxes(title_text='Frequency')
freq_dist.show()

In the code:

  • Lines 1–2: Create a histogram with average_rating from the dataset.

  • Lines 3–4: Define the axis titles for the histogram.

  • Line 5: Print the histogram.

Secondly, we do plotting against the count of books per author. Before that, we can skim out the top 15 authors from the dataset. Next, to achieve this, we can use the following code:

top_recc_authors = book_data['authors'].value_counts().head(15)
book_count = px.bar(top_recc_authors, x=top_recc_authors.values, y=top_recc_authors.index, orientation='h',
labels={'x': 'Book Count', 'y': 'Author'},
title='Books Count by Author')
book_count.show()

In the code:

  • Line 1: Extract the top 15 authors from the dataset.

  • Lines 2–4: Create a bar graph indicating the number of books per author.

  • Line 5: Print the bar graph.

Vectorize the data

Convert the average_rating column from object type to numeric type, and then convert the numeric values to vectors so they can be used in calculating cosine similarity. To do this, we use the following code:

book_data['average_rating'] = pd.to_numeric(book_data['average_rating'], errors='coerce')
book_data['book_content'] = book_data['title'] + ' ' + book_data['authors']
# Using Term Frequency-Inverse Document Frequency to vectorize the numerical values
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(book_data['book_content'])

In the code:

  • Line 1: Convert the average_rating column of the book from object data type to numeric data type.

  • Line 2: Make a new column with name book_content by combining the title and author column.

  • Lines 5–6: Use Term Frequency-Inverse Document Frequency (TF-IDF) to vectorize the numeric data calculated of the column average_rating.

Calculate cosine similarity

We calculate the cosine similarity between books and use it to generate a list of recommended books. To do this, we use the following code:

# Calculate similarity between the computed vectors
vector_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)
# Recommendation function
def generate_reccomendation(book_title, vector_similarity=vector_similarity):
# Extract index of the book to find
book_index = book_data[book_data['title'] == book_title].index[0]
# Get the similarity score of all book books with the one book
similarity_score = list(enumerate(vector_similarity[book_index]))
similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
# Get top 15 reccomended books
similarity_score = similarity_score[1:15]
recc_index= [i[0] for i in similarity_score]
return book_data['title'].iloc[recc_index]

In the code:

  • Line 1: Calculate the similarity between books present in the dataset and save it in vector_similarity.

  • Line 5: Define a function generate_reccomendation with book_title and vector_ similarity as parameters.

  • Line 7: Find the index of the book from the dataset as book_index.

  • Lines 10–11: Find the similarity_score of all the books in the vector with the one specified book as book_title.

  • Lines 14–16: Extract the top 15 books with the most similarity, excluding itself, and return the results.

Testing example

We can then print the recommended books extracted from the recommendation function by using the specific book title as an example. The function then calculates the similarity score of the book title with all the books in the dataset and skims out the top 15 matches with it. Here's how you can use it:

book = "The New York Trilogy"
extract_books = generate_reccomendation(book)
print(extract_books)

In the code:

  • Line 1: Define a book name and assign it to the variable book.

  • Line 2: Call the function generate_reccomendation and save it it in extract_books.

  • Line 3: Print the extracted books.

Executable example

The running example of the following algorithm is shown below.

Note: Instead of using local dataset i have created a data frame from a list of books.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import plotly.express as px
import nltk
from nltk.corpus import gutenberg
# Create a list of book titles and authors from the Gutenberg corpus
books = [
{"title": "Moby Dick", "author": "Herman Melville"},
{"title": "Alice in Wonderland", "author": "Lewis Carroll"},
{"title": "Dracula", "author": "Bram Stoker"},
{"title": "Frankenstein", "author": "Mary Shelley"},
{"title": "Pride and Prejudice", "author": "Jane Austen"},
{"title": "The Odyssey", "author": "Homer"},
{"title": "Macbeth", "author": "William Shakespeare"},
{"title": "Hamlet", "author": "William Shakespeare"},
{"title": "Romeo and Juliet", "author": "William Shakespeare"},
{"title": "The Iliad", "author": "Homer"},
{"title": "Great Expectations", "author": "Charles Dickens"},
{"title": "A Tale of Two Cities", "author": "Charles Dickens"},
{"title": "Jane Eyre", "author": "Charlotte Brontë"},
{"title": "Wuthering Heights", "author": "Emily Brontë"},
{"title": "The Scarlet Letter", "author": "Nathaniel Hawthorne"}
]
# Convert the list of books to a pandas DataFrame
book_data = pd.DataFrame(books)
# Display basic info about the dataset
print("Dataset Information:")
print(book_data.info())
# Plot the number of books by author
top_recc_authors = book_data['author'].value_counts().head(15)
book_count = px.bar(top_recc_authors, x=top_recc_authors.values, y=top_recc_authors.index, orientation='h',
labels={'x': 'Book Count', 'y': 'Author'},
title='Books Count by Author')
book_count.show()
# Combine the title and author into a single string for each book
book_data['book_content'] = book_data['title'] + ' ' + book_data['author']
# Use Term Frequency-Inverse Document Frequency to vectorize the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(book_data['book_content'])
# Calculate similarity between the computed vectors
vector_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)
# Recommendation function for top three books
def generate_top_three_recommendations(book_title, vector_similarity=vector_similarity):
# Extract index of the book to find
book_index = book_data[book_data['title'] == book_title].index[0]
# Get the similarity score of all books with the specified book
similarity_scores = list(enumerate(vector_similarity[book_index]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
# Get top three recommended books (excluding the book itself)
top_three_scores = similarity_scores[1:4]
top_three_indices = [i[0] for i in top_three_scores]
return book_data['title'].iloc[top_three_indices]
# Example book title for top three recommendations
book = "Romeo and Juliet"
top_three_recommendations = generate_top_three_recommendations(book)
print(f"\nTop Three Books Recommended for '{book}':")
print(top_three_recommendations)

Code Explanation:

  • Lines 1–6 (Importing libraries):

    1. Imports necessary libraries:

      1. pandas (pd): Used for data manipulation and analysis.

      2. TfidfVectorizer from sklearn.feature_extraction.text: Used to convert text data into numerical form using TF-IDF (Term Frequency-Inverse Document Frequency).

      3. linear_kernel from sklearn.metrics.pairwise: Computes linear kernel similarity between vectors.

      4. plotly.express (px): Used for interactive visualization.

      5. nltk (Natural Language Toolkit): Provides text processing libraries.

      6. gutenberg corpus from nltk.corpus: Specifically used to access the Gutenberg dataset.

  • Lines 10–26 (Creating book data):

    1. Defines a list books containing dictionaries, where each dictionary represents a book with title and author.

    2. These books are classic literature titles from various authors.

  • Line 29 (Creating DataFrame):

    1. Converts the list of dictionaries (books) into a pandas DataFrame book_data.

    2. Each dictionary key (title, author) becomes a column in the DataFrame.

  • Line 33 (Displaying dataset information):

    1. Prints basic information about the book_data DataFrame using book_data.info().

    2. This includes the DataFrame class, range of indices, column names, non-null counts, and memory usage.

  • Lines 36–40 (Plotting books count by author):

    1. Uses Plotly (px.bar) to create a horizontal bar chart (orientation='h').

    2. Displays the count of books by each author (top_recc_authors) using their names on the y-axis and book counts on the x-axis.

    3. Sets labels and title for the chart.

  • Lines 46–47 (Vectorizing text data):

    1. Combines the title and author columns into a new column book_content in book_data.

    2. Uses TfidfVectorizer to convert book_content into a TF-IDF matrix (tfidf_matrix).

    3. This matrix represents each book as a numerical vector based on its textual content.

  • Line 50 (Calculating Similarity):

    1. Computes the similarity between all pairs of books using linear_kernel on tfidf_matrix.

    2. vector_similarity stores the similarity scores between each pair of books.

  • Lines 53–65 (Generating top three recommendations):

    1. Defines a function generate_top_three_recommendations(book_title, vector_similarity) to recommend the top three books similar to a given book_title.

    2. Extracts the index of the specified book_title from book_data.

    3. Retrieves similarity scores of all books relative to the specified book, sorts them in descending order, and excludes the book itself.

    4. Returns the titles of the top three recommended books.

Lines 68–71 (Example usage):

    1. Sets an example book title (book = "Romeo and Juliet").

    2. Calls generate_top_three_recommendations(book) to obtain and print the top three recommended books similar to “Romeo and Juliet”.

Conclusion

The Answer concludes that building a book recommendation system in Python effectively enhances user experience by leveraging machine learning techniques.

By analyzing user preferences and historical data, the system can provide personalized recommendations, thereby facilitating better book discovery. This process involves various steps, including dataset preparation, data visualization, vectorization, and similarity calculation, illustrating the practical application of data science in real-world scenarios.

Overall, implementing such a system demonstrates how technology can transform the way readers find and engage with literature.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What algorithms are used in book recommendation system?

Common algorithms include:

  • Collaborative filtering: Recommends based on user-item interactions.
  • Content-based filtering: Suggests similar items based on past user preferences.
  • Matrix factorization: Reduces dimensionality to uncover user preference patterns.
  • Hybrid models: Combines multiple approaches for improved accuracy.

What technology is used in book recommendation system?

Technologies used include:

  • Machine learning libraries: Such as TensorFlow and Scikit-Learn.
  • Databases: SQL and NoSQL databases for data storage.
  • Web technologies: APIs for integration into applications.
  • Big data technologies: Tools like Apache Spark for handling large datasets.

Is a recommendation system AI or machine learning?

Recommendation systems incorporate both:

  • AI: Simulates human decision-making.
  • ML: Uses machine learning techniques to improve recommendations based on data patterns.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved