Gensim is a widely used Python library for natural language processing (NLP) tasks and comprises the MatrixSimilarity
function, crucial in measuring the similarity between documents based on their content.
gensim.similarities.MatrixSimilarity()
functionThe gensim.similarities.MatrixSimilarity()
function in Gensim is used to calculate the similarity between documents using the concept of
This function helps us quantify document similarity, gain insights into text relationships, identify related documents, and improve effectiveness in NLP tasks.
The syntax for using the gensim.similarities.MatrixSimilarity()
function is given below:
similarity_matrix = MatrixSimilarity(corpus, num_features=num_features)
corpus
is a required parameter, representing the corpus of documents as a list of vectors or a sparse matrix.
num_features
is an optional parameter representing the dimensionality of the feature space. If not given, it will be assumed from the corpus.
Note: Make sure you have the Gensim library installed (you can install it using pip install gensim
).
Let's implement the gensim.similarities.MatrixSimilarity()
function in the code below:
from gensim.similarities import MatrixSimilarityfrom gensim.corpora import Dictionarytexts = [['apple', 'banana', 'orange'], ['orange', 'kiwi', 'grape']]dictionary = Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]similarity_matrix = MatrixSimilarity(corpus)similarity = similarity_matrix[corpus[0]]print(similarity)
Line 1–2: Firstly, we import the necessary modules and classes from Gensim like MatrixSimilarity
from gensim.similarities
and Dictionary
from gensim.corpora
.
Line 4: Next, we define a list of texts
containing two sublists representing different documents.
Line 5: Here, we create a Dictionary
object, dictionary
, to represent the vocabulary of the documents.
Line 6: Now, we convert each document in texts
to a bag-of-words representation using dictionary.doc2bow()
. This converts each document into a list of tuples containing a word ID and its frequency. We store the result in corpus
variable.
Note: To learn more about Bag-of-Words (BoW) corpus, click here.
Line 7–8: Moving on, we initialize the similarity_matrix
using MatrixSimilarity
(corpus). This creates a similarity index based on the given corpus. Then, we compute the similarity of the first document (corpus[0]
) by accessing similarity_matrix[corpus[0]]
.
Line 9: Finally, we print the similarity scores, which indicate the similarity between the first document and each document in the corpus.
Upon execution, the code will print the similarity scores between the first document and all other documents in the corpus.
In the case of cosine similarity, a value of 1 indicates that the two vectors being compared are identical, and a value of 0 represents no similarity or orthogonality. Values between 0 and 1 indicate varying degrees of similarity.
The output looks like this:
[0.99999994, 0.3333333]
The first value 0.99999994
means that the first document is compared with itself, resulting in a similarity score rounded off to 1, as they are identical. The second value 0.33333334
denotes the similarity score between the first and second documents. It indicates that the two documents have some overlapping words but are not identical.
Overall, the MatrixSimilarity
function in Gensim is a strong tool for exploring document similarity. The functionality of creating a similarity matrix facilitates NLP developers to compare and measure document similarity.
Free Resources