Preprocessing steps in Natural Language Processing (NLP)

Overview

Natural Language Processing (NLP) is the ability of a machine to read, write, understand and derive meaning from a human language.

Steps in NLP

Tokenization
Stemming
Lemmatization
Part-of-speech (POS) tagging
Named entity recognition
Chunking

Let’s try to understand them in more detail.

Tokenization: We break down the text into tokens. Check the example below to see how this is done.

Text: The cat sat on the bed. Tokens: The, cat, sat, on, the, bed

Stemming: We remove the prefixes and suffixes to obtain the root word. Check the example below to see how it’s done.

List of words: Affection, Affects, Affecting, Affected, Affecting
Root word: Affect

Lemmatization: We group together different inflected forms of a word into a base word called lemma. Check the example below how it’s done.

List of words: going, gone, went
Lemma: go

4.POS tagging: We identify the parts of speech for different tokens. Check the example below to see how it’s done.

Sentence: The dog killed the bat.
Parts of speech: Definite article, noun, verb, definite article, noun.

5.Named entity recognition: We classify named entities mentioned in the text into categories such as “People,” “Locations,” “Organizations,” and so on. Check the example below to see how it’s done.

Text: Google CEO Sundar Pichai resides in New York.
Named entity recognition:
Google — Organization
Sundar Pichai — Person
New York — Location

6.Chunking: We pick up individual pieces of information and group them into bigger pieces.

Example

import nltk
nltk.download('all-nltk')
print("\n")

# Creating token of words
print("Creating token of words:")
from nltk.tokenize import word_tokenize
text="My name is Adithya Challa I wrote this shot!"
tokenize_word=word_tokenize(text)
print(tokenize_word)
print("\n")

# Stemming
print("Stemming:")
from nltk.stem import PorterStemmer
words=["light","lighting","lights"]
ps=PorterStemmer()
for w in words:
    rootword=ps.stem(w)
    print(rootword)
print("\n")

#Lemmatiztion:Converts allverb forms into root word
print("Lemmatiztion:Converts allverb forms into root word:")
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("playing"))
print("\n")

#POS Tag
print("POS Tag:")
from nltk import word_tokenize,pos_tag
text="My name is Adithya Challa I wrote this shot!"
print(pos_tag(word_tokenize(text)))

Implementation of NLP preprocessing steps

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Preprocessing steps in Natural Language Processing (NLP)

Overview

Steps in NLP

Example

Explanation