Gensim is a robust Python NLP package that supports creating and manipulating Bag-of-Words (BoW) corpora.
The Bag-of-Words (BoW) approach is commonly used in NLP to represent text documents as numerical feature vectors. It ignores the sequence and order of words in a document instead of focusing exclusively on word frequencies.
In Gensim, the corpus contains the word ID and its frequency in every document. We can create a BoW corpus from a simple list of documents by passing the tokenized list of words to the method Dictionary.doc2bow()
.
Creating a BoW corpus involves the following steps:
First, we import all the necessary packages from Gensim as follows:
import gensimimport pprintfrom gensim import corporafrom gensim.utils import simple_preprocess
Now, we provide a simple list of documents containing sentences. We will be using two sentences for our convenience.
doc_list = ["Hello, Welcome to Educative Answers.","This is an answer on BoW corpus in Gensim."]
Next, we preprocess the text data by tokenizing the documents, removing stop words, and performing other necessary preprocessing steps.
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
Now, we build a dictionary from the preprocessed text data by creating a corpora object. It maps words to unique integer IDs.
dictionary = corpora.Dictionary()
Moving on, we iterate over the tokenized documents and convert each document into a BoW representation using the dictionary.doc2bow()
method.
BoW_corpus =[dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
At last, we print we BoW corpus on the console:
print(BoW_corpus)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1),(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]
Human developers cannot easily comprehend the above output. For this reason, we can convert these IDs to words using a dictionary.
id_words =[[(dictionary[id], count) for id, count in line] for line in BoW_corpus]print(id_words)
The revised output looks like this:
[[('answers', 1), ('educative', 1), ('hello', 1), ('to', 1),('welcome', 1)], [('an', 1), ('answer', 1), ('bow', 1), ('corpus', 1),('gensim', 1), ('in', 1), ('is', 1), ('on', 1), ('this', 1)]]
The complete implementation of the BoW corpus is given below:
import gensimimport pprintfrom gensim import corporafrom gensim.utils import simple_preprocessdoc_list = ["Hello, Welcome to Educative Answers.","This is an answer on BoW corpus in Gensim."]doc_tokenized = [simple_preprocess(doc) for doc in doc_list]dictionary = corpora.Dictionary()BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]print(BoW_corpus)print('\n')id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]print(id_words)
Once the BoW corpus is created, it can be utilized for various NLP tasks:
We can apply topic modeling algorithms like LDA on the BoW corpus to extract underlying topics.
We can calculate document similarity using similarity measures like cosine similarity on the BoW vectors.
We can index and retrieve documents efficiently based on keyword queries using the BoW corpus.
Therefore, Bag-of-Words (BoW) corpus in Gensim provides a flexible framework for modeling text documents as numerical feature vectors. Developers may use Gensim's features for topic modeling, document similarity analysis, and information retrieval by building a BoW corpus.
Free Resources