What is chunking in nltk?

NLTK (Natural Language Toolkit) is a Python-based module widely used in natural language processing (NLP) research, designed for tasks such as text processing, tokenization, parsing, etc.

Tokenization

Tokenization in NLTK means breaking down the text into individual tokens or words. It can be further processed for tasks like text analysis and language modeling.

Chunking in NLTK

Chunking is the process of dividing words into chunks depending on their syntactic form. It recognizes and extracts specified patterns or phrases from a sentence or text document.

NLTK supports flexible chunking, allowing users to construct patterns using regular expressions or preset grammar, such as noun phrase chunking and verb phrase chunking.

Noun phrase is built around a noun e.g., “A new laptop”.
Verb phrase is built around a verb e.g., “Bought a new laptop”.

It uses parts-of-speech (POS) tags to group words and apply chunk tags to the group of words. Because chunks can not overlap, a single occurrence of a word can appear in only one chunk at a time.

Chunk patterns

We will be using these chunking patterns in our coding example.

NP represents the noun phrase chunk pattern. It uses an optional determiner (DT), followed by zero or more adjectives (JJ), and a noun (NN).
VP represents the verb phrase chunk pattern. It uses a verb (VB followed by any suffix .*), followed by a noun phrase (NP) or a prepositional phrase (PP).

Note: A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

Furthermore, NLTK provides a module called nltk.chunk that includes the RegexpParser class, which allows to define chunk patterns using regular expressions.

Code

You can find a basic example of how to perform chunking using NLTK below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.chunk import RegexpParser
from nltk.tokenize import word_tokenize
# Example sentence
sentence = "Educative Answers is a free web encyclopedia written by devs for devs."
# Tokenization
tokens = word_tokenize(sentence)
# POS tagging
pos_tags = nltk.pos_tag(tokens)
# Chunking patterns
chunk_patterns = r"""
    NP: {<DT>?<JJ>*<NN>}  # Chunk noun phrases
    VP: {<VB.*><NP|PP>}  # Chunk verb phrases
"""
# Create a chunk parser
chunk_parser = RegexpParser(chunk_patterns)
# Perform chunking
result = chunk_parser.parse(pos_tags)
# Print the chunked result
print(result)

Code explanation

Line 1: Firstly, we import the module NLTK.
Line 2–3: Then, we download the required resources for tokenization and POS tagging.
Line 5–6: In these lines, we import the necessary modules from NLTK like RegexpParser, and word_tokenize.
Line 9: We put the example sentence for chunking in a variable.
Line 12: We use the tokens variable for storing the resultring tokens from the function word_tokenize(), used to tokenize the example sentence.
Line 15: Now, we use the pos_tag() function that assigns part-of-speech tags to each token in tokens. The pos_tags variable stores the result.
Line 18–21: In these lines, we use the chunk_patterns variable to hold the two chunking patterns.
Line 24: We use the RegexpParser() function to create a chunk parser based on the chunk_patterns taken as an argument.
Line 27: Now, we call the parse() method of the chunk parser with pos_tags as the argument.
Line 30: Finally, we print the result to display the extracted chunks from the sentence based on the defined patterns.

Output

The output represents the chunked result of the sentence “Educative Answers is a free web encyclopedia written by devs for devs.” into noun phrases (NP) and verb phrases (VP) based on the defined patterns.

Let's understand what happens in the output: