NLTK (Natural Language Toolkit) is a Python-based module widely used in natural language processing (NLP) research, designed for tasks such as text processing, tokenization, parsing, etc.
Tokenization in NLTK means breaking down the text into individual tokens or words. It can be further processed for tasks like text analysis and language modeling.
Chunking is the process of dividing words into chunks depending on their syntactic form. It recognizes and extracts specified patterns or phrases from a sentence or text document.
NLTK supports flexible chunking, allowing users to construct patterns using regular expressions or preset grammar, such as noun phrase chunking and verb phrase chunking.
Noun phrase is built around a noun e.g., “A new laptop”.
Verb phrase is built around a verb e.g., “Bought a new laptop”.
It uses parts-of-speech (POS) tags to group words and apply chunk tags to the group of words. Because chunks can not overlap, a single occurrence of a word can appear in only one chunk at a time.
Chunk patterns
We will be using these chunking patterns in our coding example.
chunk_patterns = r"""NP: {<DT>?<JJ>*<NN>}VP: {<VB.*><NP|PP>}"""
NP represents the noun phrase chunk pattern. It uses an optional determiner (DT
), followed by zero or more adjectives (JJ
), and a noun (NN
).
VP represents the verb phrase chunk pattern. It uses a verb (VB
followed by any suffix .*
), followed by a noun phrase (NP
) or a prepositional phrase (PP
).
Note: A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.
Furthermore, NLTK provides a module called nltk.chunk
that includes the RegexpParser
class, which allows to define chunk patterns using regular expressions.
You can find a basic example of how to perform chunking using NLTK below:
import nltknltk.download('punkt')nltk.download('averaged_perceptron_tagger')from nltk.chunk import RegexpParserfrom nltk.tokenize import word_tokenize# Example sentencesentence = "Educative Answers is a free web encyclopedia written by devs for devs."# Tokenizationtokens = word_tokenize(sentence)# POS taggingpos_tags = nltk.pos_tag(tokens)# Chunking patternschunk_patterns = r"""NP: {<DT>?<JJ>*<NN>} # Chunk noun phrasesVP: {<VB.*><NP|PP>} # Chunk verb phrases"""# Create a chunk parserchunk_parser = RegexpParser(chunk_patterns)# Perform chunkingresult = chunk_parser.parse(pos_tags)# Print the chunked resultprint(result)
Line 1: Firstly, we import the module NLTK.
Line 2–3: Then, we download the required resources for tokenization and POS tagging.
Line 5–6: In these lines, we import the necessary modules from NLTK like RegexpParser
, and word_tokenize
.
Line 9: We put the example sentence for chunking in a variable.
Line 12: We use the tokens
variable for storing the resultring tokens from the function word_tokenize()
, used to tokenize the example sentence.
Line 15: Now, we use the pos_tag()
function that assigns part-of-speech tags to each token in tokens
. The pos_tags
variable stores the result.
Line 18–21: In these lines, we use the chunk_patterns
variable to hold the two chunking patterns.
Line 24: We use the RegexpParser()
function to create a chunk parser based on the chunk_patterns
taken as an argument.
Line 27: Now, we call the parse()
method of the chunk parser with pos_tags
as the argument.
Line 30: Finally, we print the result to display the extracted chunks from the sentence based on the defined patterns.
The output represents the chunked result of the sentence “Educative Answers is a free web encyclopedia written by devs for devs.” into noun phrases (NP) and verb phrases (VP) based on the defined patterns.
Let's understand what happens in the output:
To conclude, chunking in NLTK is a powerful tool for analyzing sentences and extracting meaningful noun and verb phrases by grouping words together based on their grammar. Many NLP applications find it essential, allowing them to discover useful information and perform more complex language processing.
.
Free Resources