Item recommendation algorithms can be broadly classified into two categories:
In this Answer, we’ll focus on content-based recommendation. This method will recommend articles based on their content. Since the content is language, we’ll use natural language processing to perform the task at hand.
Transformers were introduced in 2017 by Vasvani et al. at Facebook, and models inspired by these have since then redefined how natural language processing is done. The Transformer-series models work by producing numerical representations of any given text by running it through stacked encoders/decoders. These representations are called word embeddings, and can be used to search for related items using Euclidean distance or cosine similarity methods.
Now, we have understood which recommendation technique to use and how word embeddings can help in finding related items for text-based recommendations. Let’s move on to learning how to perform text-based recommendations using word embeddings.
We’ll use the ARXIV paper summary dataset to recommend research papers based on a selected paper. We will do this by creating word embeddings for the summary of each paper, building a search index on these embeddings, and finally searching this index with the embeddings of a given paper to find similar items. The main idea is that readers of a given paper might be interested in similar research papers.
The dataset contains the following features:
id
: The ID number of the papertitle
: The title of the papersummary
: The abstract of the paperyear
: The year of publishmonth
: The month of publishday
: The day of publishIn the following code we will use embeddings of summaries that have already been generated, for the sake of simplicity. We will select an article, and then search for similar articles based on the Euclidean distance.
import pandas as pdimport picklefrom IPython.display import displayfrom sentence_transformers import SentenceTransformerfrom sklearn import preprocessingimport faissimport numpy as npdef id2info(df, I, column):'''returns the paper info given the paper id'''return [list(df[df.id == idx][column]) for idx in I[0]]# Load datadata = pd.read_json('arxivData.json')df = data.drop(columns=["author", "link", 'tag'])display(df[['id', 'title']])print("number of Machine Learning papers: ", df.id.unique().shape[0])# Encode IDsle = preprocessing.LabelEncoder()df['id'] = le.fit_transform(df.id)# Load or create embeddings# model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')# Create embeddings: Convert abstracts to vectors# embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)# with open('embeddings.pickle', 'wb') as pkl:# pickle.dump(embeddings, pkl)with open('embeddings.pickle', 'rb') as pkl:embeddings = pickle.load(pkl)embeddings = np.array([embedding for embedding in embeddings]).astype("float32")# Create FAISS indexL2index = faiss.IndexFlatL2(embeddings.shape[1])L2index = faiss.IndexIDMap(L2index)L2index.add_with_ids(embeddings, df.id.values)print("Number of embeddings in the Faiss index: ", L2index.ntotal)print('Shape of the one embedding: ', embeddings[0].shape)# Search for similar articlestarget_id = 5415D, I = L2index.search(np.array([embeddings[target_id]]), k=10)# Display resultstarget_title = df[df.id == target_id]['title'].values[0]results = pd.DataFrame({'L2 distance': D.flatten().tolist(), 'Titles': id2info(df, I, 'title')})display(pd.DataFrame({'Target Title': [target_title]}))display(results)
Lines 10–14: Define a function to retrieve item details using an index. It selects rows by index and retrieves specific column values for each match returned by the search.
Line 17: Load the dataset.
Line 18: Drop irrelevant columns from the DataFrame.
Lines 23–24: The id
column contains string values. Since FAISS only works with integers, we use LabelEncoder
from the sklearn
package to convert string IDs to integer IDs.
Lines 27–31: Load the model and generate embeddings. This section is commented out because precomputed embeddings are already available.
Lines 32–33: Load the saved embeddings from a file named embeddings.pickle
in the current directory.
Lines 37–40: Set up the FAISS index:
Convert the embeddings list to a NumPy array of type float32
.
Create a FAISS index using L2 (Euclidean) distance.
Wrap the index with an ID map to connect embeddings with their IDs.
Add the embeddings and their IDs from the DataFrame to the index.
Line 47: Search the index using a selected embedding.
Lines 50–53: Display the search results as recommendations.
Free Resources