How to create BoW corpus in Gensim

Gensim is a robust Python NLP package that supports creating and manipulating Bag-of-Words (BoW) corpora.

What is Bag-of-Words (BoW)?

The Bag-of-Words (BoW) approach is commonly used in NLP to represent text documents as numerical feature vectors. It ignores the sequence and order of words in a document instead of focusing exclusively on word frequencies.

Creating a BoW Corpus in Gensim

In Gensim, the corpus contains the word ID and its frequency in every document. We can create a BoW corpus from a simple list of documents by passing the tokenized list of words to the method Dictionary.doc2bow().

Creating a BoW corpus involves the following steps:

  1. First, we import all the necessary packages from Gensim as follows:

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
  1. Now, we provide a simple list of documents containing sentences. We will be using two sentences for our convenience.

doc_list = [
"Hello, Welcome to Educative Answers.",
"This is an answer on BoW corpus in Gensim."
]
  1. Next, we preprocess the text data by tokenizing the documents, removing stop words, and performing other necessary preprocessing steps.

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
  1. Now, we build a dictionary from the preprocessed text data by creating a corpora object. It maps words to unique integer IDs.

dictionary = corpora.Dictionary()
  1. Moving on, we iterate over the tokenized documents and convert each document into a BoW representation using the dictionary.doc2bow() method.

BoW_corpus =
[dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
  1. At last, we print we BoW corpus on the console:

print(BoW_corpus)

Output

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1),
(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]

Human developers cannot easily comprehend the above output. For this reason, we can convert these IDs to words using a dictionary.

id_words =
[[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

The revised output looks like this:

[[('answers', 1), ('educative', 1), ('hello', 1), ('to', 1),
('welcome', 1)], [('an', 1), ('answer', 1), ('bow', 1), ('corpus', 1),
('gensim', 1), ('in', 1), ('is', 1), ('on', 1), ('this', 1)]]

Code

The complete implementation of the BoW corpus is given below:

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_list = [
"Hello, Welcome to Educative Answers.",
"This is an answer on BoW corpus in Gensim."
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)
print('\n')
id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

Usage

Once the BoW corpus is created, it can be utilized for various NLP tasks:

  • We can apply topic modeling algorithms like LDA on the BoW corpus to extract underlying topics.

  • We can calculate document similarity using similarity measures like cosine similarity on the BoW vectors.

  • We can index and retrieve documents efficiently based on keyword queries using the BoW corpus.

Conclusion

Therefore, Bag-of-Words (BoW) corpus in Gensim provides a flexible framework for modeling text documents as numerical feature vectors. Developers may use Gensim's features for topic modeling, document similarity analysis, and information retrieval by building a BoW corpus.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved