In Natural Language Processing (NLP), dealing with text data efficiently is essential for tasks like text classification, sentiment analysis, and information retrieval. One of the fundamental preprocessing steps in NLP is lemmatization. Lemmatization helps reduce different word forms to a common base form, simplifying text analysis. In this Answer, we’ll explore the concept of lemmatization, its importance, and how we can use it using the popular NLP library, spaCy.
A lemma is the base form of a token. For example, the lemma of the word eating is eat. The lemma represents the canonical or dictionary form of a word, which helps group together win grouping together words with similar meanings. It is commonly used in NLP tasks such as text classification, information retrieval, language modeling, etc.
spaCy is a popular NLP library in Python and provides elegant solutions for various NLP and ML-related tasks, including lemmatization. For this task, we can use the built-in lemmatizer of spaCy itself. Let’s see how we can achieve this:
import spacy nlp = spacy.load("en_core_web_md", disable=["ner", "parser"]) doc = nlp("I have been working at this place for many years") for token in doc: print(token.text, token.lemma_)
Let’s go over the code above.
Lines 1–2: We import the spacy
library and load the en_core_web_md
language model. The en_core_web_md
model includes a comprehensive vocabulary, word vectors, POS tags, and syntactic dependencies based on web texts, making it suitable for general-purpose text processing. The disable=["ner", "parser"]
parameter here is used to disable the named entity recognition (NER) and syntactic parser components of the spaCy pipeline since we are not using it. This avoids the memory to be overloaded.
Line 3: We create a doc
object and added our example sentence in it.
Lines 4–5: We declare a loop that will loop through all the tokens
in the doc
and print the text
in the token and the lemma of the token.
Let’s look at another example where we’ll process a large text in smaller batches to avoid memory overload. We’ll use the en_core_web_md
model and process text in chunks.
import spacy# Function to process text in batchesdef process_text_in_batches(text, batch_size=10):nlp = spacy.load("en_core_web_md")for i in range(0, len(text), batch_size):batch = text[i:i + batch_size]doc = nlp(batch)for token in doc:print(token.text, token.lemma_)# Sample large texttext = ("I have been working at this place for many years. " * 1000)# Process the text in batches of 100 charactersprocess_text_in_batches(text, batch_size=100)
Here’s the line-by-line explanation of the code example above:
Line 1: We import the spacy
library. and define a function named process_text_in_batches
.
Line 4: We define a function named process_text_in_batches
.
Line 5: We load the en_core_web_md
language model.
Line 6: We declare a loop that will iterate through the text in steps of batch_size
.
Line 7: We create a batch of the text with the specified batch size.
Line 8: We process the batch with the spaCy model to create a doc
object.
Line 9: We declare a loop that will iterate through all the tokens in the doc
.
Line 10: We print the text of the token and the lemma of the token.
Line 13: We create a sample large text by repeating a sentence
Line 16: We call the process_text_in_batches
function with the sample text and a batch size of
Free Resources