Text normalization (i.e., preparing text, words and documents) is one of the most fundamental tasks of Natural Language processing field. These text normalization techniques are called Stemming and Lemmatization. nltk.stem
is one of the most widely used libraries for python for Stemming and Lemmitization.
Examples of Stemming and Lemmitization:
Now, the words
cars, car’s, CAR, Car, and cars’ are all derived from the nltk.stem
, all of these words will be mapped to car
.
Stemming and Lemmatization is pretty much the same thing. However, Lemmatization is more efficient as it allows for more context so that all words with similar meanings are grouped as one. For example,
Lemmatization is widely used technique in search engines and other retrieval systems.
Moreover, lemmatization introduces a
By default,
pos
takes the value “noun”.
// import libraries
from nltk.stem import WordNetLemmatizer
// create an instance of the class.
lemmatizer = WordNetLemmatizer()
>> lemmatizer.lemmatize("rocks")
rock
>> lemmatizer.lemmatize("corpora")
corpus
// pos parameter is given. "a" refers to as adjective.
>> lemmatizer.lemmatize("better", pos ="a")
good
Free Resources