How to perform text-based recommendations using word embeddings

According to Straits researchhttps://straitsresearch.com/report/recommendation-engines-market, the global market for recommender systems was 3 billion USD in 2021 and is projected to reach USD 54 billion by 2030. There is a reason why companies such as Alibaba and Amazon spend millions of dollars improving their recommendation algorithm. These small improvements to increase the conversion rate, including the chances of making a cross sell, holds great potential. Given the huge size of the market, these small increments can aggregate into a colossal increase in the company’s revenue.

Recommender systems

Item recommendation algorithms can be broadly classified into two categories:

Collaborative filtering: This technique uses similarity in the customer’s choice with others, to recommend products.
Content-based: This technique uses the similarity of item descriptions with other items’ descriptions to recommend products to a customer.

In this Answer, we’ll focus on content-based recommendation. This method will recommend articles based on their content. Since the content is language, we’ll use natural language processing to perform the task at hand.

Transformer based models

Transformers were introduced in 2017 by Vasvani et al. at Facebook, and models inspired by these have since then redefined how natural language processing is done. The Transformer-series models work by producing numerical representations of any given text by running it through stacked encoders/decoders. These representations are called word embeddings, and can be used to search for related items using Euclidean distance or cosine similarity methods.

Now, we have understood which recommendation technique to use and how word embeddings can help in finding related items for text-based recommendations. Let’s move on to learning how to perform text-based recommendations using word embeddings.

Problem and dataset

We’ll use the ARXIV paper summary dataset to recommend research papers based on a selected paper. We will do this by creating word embeddings for the summary of each paper, building a search index on these embeddings, and finally searching this index with the embeddings of a given paper to find similar items. The main idea is that readers of a given paper might be interested in similar research papers.

The dataset contains the following features:

id: The ID number of the paper
title: The title of the paper
summary: The abstract of the paper
year: The year of publish
month: The month of publish
day: The day of publish

Code

In the following code we will use embeddings of summaries that have already been generated, for the sake of simplicity. We will select an article, and then search for similar articles based on the Euclidean distance.

import pandas as pd
import pickle
from IPython.display import display
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing
import faiss
import numpy as np
def id2info(df, I, column):
    '''
    returns the paper info given the paper id
    '''
    return [list(df[df.id == idx][column]) for idx in I[0]]
# Load data
data = pd.read_json('arxivData.json')
df = data.drop(columns=["author", "link", 'tag'])
display(df[['id', 'title']])
print("number of Machine Learning papers: ", df.id.unique().shape[0])
# Encode IDs
le = preprocessing.LabelEncoder()
df['id'] = le.fit_transform(df.id)
# Load or create embeddings
# model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Create embeddings: Convert abstracts to vectors
# embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)
# with open('embeddings.pickle', 'wb') as pkl:
#     pickle.dump(embeddings, pkl)
with open('embeddings.pickle', 'rb') as pkl:
    embeddings = pickle.load(pkl)
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")
# Create FAISS index
L2index = faiss.IndexFlatL2(embeddings.shape[1])
L2index = faiss.IndexIDMap(L2index)
L2index.add_with_ids(embeddings, df.id.values)
print("Number of embeddings in the Faiss index: ", L2index.ntotal)
print('Shape of the one embedding: ', embeddings[0].shape)
# Search for similar articles
target_id = 5415
D, I = L2index.search(np.array([embeddings[target_id]]), k=10)
# Display results
target_title = df[df.id == target_id]['title'].values[0]
results = pd.DataFrame({'L2 distance': D.flatten().tolist(), 'Titles': id2info(df, I, 'title')})
display(pd.DataFrame({'Target Title': [target_title]}))
display(results)

Explanation

Lines 10–14: Define a function to retrieve item details using an index. It selects rows by index and retrieves specific column values for each match returned by the search.
Line 17: Load the dataset.
Line 18: Drop irrelevant columns from the DataFrame.
Lines 23–24: The id column contains string values. Since FAISS only works with integers, we use LabelEncoder from the sklearn package to convert string IDs to integer IDs.
Lines 27–31: Load the model and generate embeddings. This section is commented out because precomputed embeddings are already available.
Lines 32–33: Load the saved embeddings from a file named embeddings.pickle in the current directory.
Lines 37–40: Set up the FAISS index:
- Convert the embeddings list to a NumPy array of type float32.
- Create a FAISS index using L2 (Euclidean) distance.
- Wrap the index with an ID map to connect embeddings with their IDs.
- Add the embeddings and their IDs from the DataFrame to the index.
Line 47: Search the index using a selected embedding.
Lines 50–53: Display the search results as recommendations.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources