Text summarization in spaCy and NLTK

Text summarization, a subdomain of NLP, is a shortcut to reading an enormous set of documents. There are various popular NLP Libraries, two of which are Spacy and NLTK. Before learning to code, let's first understand the general approach of a text summarizer and the logic we will be following throughout coding.

While performing text summarization, the initial step is text cleaning which is more generally called the text preprocessing step. In this cleaning step, we perform tasks like removing punctuation, converting to lowercase, and handling special characters. The next step is to split our text into sentences and then further into words. After that, we obtain a word-frequency count on whose basis the sentences are ranked. The most important sentences are identified, which are then included in the final summary.

Now, let's learn to code a text summarizer in spaCy and NLTK.

Text summarizer in spaCy

Importing libraries

Firstly, we will import spaCy, which is a popular Python library for natural language processing tasks and other necessary modules.

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

Code explanation

  • Line 1: Imports spaCy library

  • Line 2– 3: Imports the stop words and punctuation from the spaCy library and string module, respectively.

Preprocessing the corpus

In the second step, we process the corpus, that is, we remove the stop words and punctuation from the text.

# Sample input text
text = """"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been lanched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow. The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."""
# Load the English language model
nlp = spacy.blank("en")
# Convert text to lowercase and tokenize without removing stop words and punctuation
doc = nlp(text.lower())
Preprocessing the text

Code explanation

  • Line 1–2: This is our sample text for summarization

  • Line 4: This line loads the English language model for spaCy. The blank("en") method creates an empty pipeline without any pre-existing components.

  • Line 7: This line processes the text variable using the spaCy language model nlp. The text.lower() converts the input text to lowercase before tokenization.

Word-frequency count

# Calculate word frequencies
word_frequencies = {}
for token in doc:
if token.text not in STOP_WORDS and token.text not in punctuation:
if token.text not in word_frequencies:
word_frequencies[token.text] = 1
else:
word_frequencies[token.text] += 1

Code explanation

  • Line 2: Initializes an empty dictionary to store word frequencies.

  • Line 3– 8: The code uses a loop to iterate through each token in the doc object. if token.text not in STOP_WORDS and token.text not in punctuation: checks if the token is not a stop word (common words like "the," "and," etc.) or a punctuation mark (e.g., period, comma). If the token meets the above condition, the code checks whether the token's text is already in the word_frequencies dictionary. If it is already in the dictionary, then it updates its frequency count. Otherwise, it stores 1 in place of the count.

Summarized text

sorted_sentences = sorted(doc.sents, key=lambda sent: sum(word_frequencies[token.text] for token in sent if token.text in word_frequencies), reverse=True)
summary = " ".join(sent.text for sent in sorted_sentences[:3])

Code explanation

The code calculates the word frequencies for each word in the text, then ranks the sentences based on the importance of the words they contain. The top three most important sentences are selected to form the summary, providing a concise representation of the original text while highlighting its key points.

Text summarizer in NLTK

Firstly, we will import NLTK, a popular Python library, for natural language processing tasks and other necessary modules.

Importing libraries

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation
from heapq import nlargest

Code explanation

Line 1– 5: Imports the NLTK library for natural language processing tasks. Also, imports the stop words and the sentence and word tokenizers along with punctuation marks from the string module.

Preprocessing the corpus

In the second step, we process the corpus, that is, we remove the stopwords and punctuation from the text.

text = """"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been lanched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow. The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."""
"""
# Convert text to lowercase
text = text.lower()
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Tokenize the sentences into words
words = [word_tokenize(sentence) for sentence in sentences]
# Flatten the list of words
words = [word for sublist in words for word in sublist]
# Remove stopwords and punctuation
stop_words = set(stopwords.words("english") + list(punctuation))
words = [word for word in words if word not in stop_words]

Code explanation

  • Line 6: The code uses text.lower() to convert the entire text to lowercase. This step ensures that the text is case-insensitive when performing further operations.

  • Line 7–12: The code uses sent_tokenize(text) from NLTK to tokenize the text into individual sentences. The resulting sentences variable is a list, then a list comprehension to tokenize each sentence in sentences into words using word_tokenize(sentence). The resulting words variable is a list of lists, where each inner list contains the words of a sentence.

  • Line 14–19: It creates a new list called words, where each element is an individual word from the original sentences. The code gets the set of English stopwords using set(stopwords.words("english") + list(punctuation)). It then filters out the stopwords and punctuation from the list of words using a list comprehension. The resulting words list now contains only meaningful words without stopwords and punctuation.

Word-frequency count

# Calculate word frequencies
word_frequencies = nltk.FreqDist(words)
# Calculate sentence scores based on word frequencies
sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence):
if word in word_frequencies:
if sentence not in sentence_scores:
sentence_scores[sentence] = word_frequencies[word]
else:
sentence_scores[sentence] += word_frequencies[word]

Code explanation

The code calculates word frequencies using nltk.FreqDist() and then calculates sentence scores based on the presence and frequency of significant words in each sentence. The resulting sentence_scores dictionary will be used to rank the sentences and select the most important ones for summarization.

Summarized text

# Get the top 3 sentences with highest scores as the summary
summary_sentences = nlargest(3, sentence_scores, key=sentence_scores.get)
# Generate summary
summary = " ".join(summary_sentences)

Code explanation

The code extracts the top 3 sentences with the highest scores from the sentence_scores dictionary and then joins these sentences to create the final summary. The summary represents the most important information extracted from the original text, making it a concise and informative summary of the text's key points.

Conclusion

In conclusion, combining HTML forms with CSS empowers web developers to create visually appealing, responsive, and user-friendly forms. The use of CSS provides flexibility, consistency, and creativity in the form design while adhering to the principles of separation of concerns and code optimization. By prioritizing form aesthetics and functionality, websites can enhance user engagement and overall user experience.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved