TextBlob
is one of the most important and basic libraries that deals with finding sentiment scores, filtering, and tokenization.
Before we move on, we need to install TextBlob
. To do so, we run the commands mentioned below in the command line tool.
Use the following code on the command line:
pip install -U textblob
python -m textblob.download_corpora
Before we proceed, it is important to understand the following terms:
Tokenization is the process of dividing or separating sentences or words from the text (corpus) into smaller units.
Suppose that the input text is “I love to eat fast food.”
After applying the tokenization to this input text, the output contains all the words separated from the sentence as follows: [“I”, “love”, “to”, “eat”, “fast”, “food”].
We can also divide a single word into tokens. For instance: banana can be tokenized to b-a-n-a-n-a.
Let’s look at a code for tokenizing text using TextBlob
.
from textblob import TextBlobcorpus = "And all the men and women merely players. \They have their exits and their entrances. \And one man in his time plays many parts."txt_obj = TextBlob(corpus)print("Word tokenization : ", txt_obj.words)print("Sentence tokenization: ", txt_obj.sentences)
In line 1, we import the required package.
From lines 3 to 5, we create a sample corpus of text.
In line 7, we create a TextBlob
object and pass the corpus we want to tokenize.
In line 9, we print the tokenization of corpus based on the words.
In line 11, we print the tokenization of corpus based on the sentence.
When we pre-process the text data, tokenization plays an important role. It divides the corpus into sentences, words, or even characters.
TextBlob
is one of the most important libraries in NLP. It offers a simple API that helps us perform NLP tasks faster.