Preprocessing steps in Natural Language Processing (NLP)

Overview

Natural Language Processing (NLP) is the ability of a machine to read, write, understand and derive meaning from a human language.

Steps in NLP

  • Tokenization
  • Stemming
  • Lemmatization
  • Part-of-speech (POS) tagging
  • Named entity recognition
  • Chunking

Let’s try to understand them in more detail.

  1. Tokenization: We break down the text into tokens. Check the example below to see how this is done.

Text: The cat sat on the bed. Tokens: The, cat, sat, on, the, bed

  1. Stemming: We remove the prefixes and suffixes to obtain the root word. Check the example below to see how it’s done.

List of words: Affection, Affects, Affecting, Affected, Affecting
Root word: Affect

  1. Lemmatization: We group together different inflected forms of a word into a base word called lemma. Check the example below how it’s done.

List of words: going, gone, went
Lemma: go

4.POS tagging: We identify the parts of speech for different tokens. Check the example below to see how it’s done.

Sentence: The dog killed the bat.
Parts of speech: Definite article, noun, verb, definite article, noun.

5.Named entity recognition: We classify named entities mentioned in the text into categories such as “People,” “Locations,” “Organizations,” and so on. Check the example below to see how it’s done.

Text: Google CEO Sundar Pichai resides in New York.
Named entity recognition:
Google — Organization
Sundar Pichai — Person
New York — Location

6.Chunking: We pick up individual pieces of information and group them into bigger pieces.

Example

import nltk
nltk.download('all-nltk')
print("\n")

# Creating token of words
print("Creating token of words:")
from nltk.tokenize import word_tokenize
text="My name is Adithya Challa I wrote this shot!"
tokenize_word=word_tokenize(text)
print(tokenize_word)
print("\n")

# Stemming
print("Stemming:")
from nltk.stem import PorterStemmer
words=["light","lighting","lights"]
ps=PorterStemmer()
for w in words:
    rootword=ps.stem(w)
    print(rootword)
print("\n")

#Lemmatiztion:Converts allverb forms into root word
print("Lemmatiztion:Converts allverb forms into root word:")
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("playing"))
print("\n")

#POS Tag
print("POS Tag:")
from nltk import word_tokenize,pos_tag
text="My name is Adithya Challa I wrote this shot!"
print(pos_tag(word_tokenize(text)))    
Implementation of NLP preprocessing steps

Explanation

  • Lines 1 and 2: We download the nltk package and import the module.

  • Lines 7–9: We use nltk.tokenize by importing word_tokenize and divide the string of words into tokens.

  • Lines 15–20: We use nltk.stem by importing PorterStemmerand remove the prefixes and suffixes to obtain a root word.

  • Lines 25 and 26: We convert all the verb forms into root words by importing WordNetLemmatizer.

  • Lines 32 and 33: We find the parts of speech by importing word_tokenize,pos_tag.

Free Resources