Text summarization, a subdomain of NLP, is a shortcut to reading an enormous set of documents. There are various popular NLP Libraries, two of which are Spacy and NLTK. Before learning to code, let's first understand the general approach of a text summarizer and the logic we will be following throughout coding.
While performing text summarization, the initial step is text cleaning which is more generally called the text preprocessing step. In this cleaning step, we perform tasks like removing punctuation, converting to lowercase, and handling special characters. The next step is to split our text into sentences and then further into words. After that, we obtain a word-frequency count on whose basis the sentences are ranked. The most important sentences are identified, which are then included in the final summary.
Now, let's learn to code a text summarizer in spaCy and NLTK.
Firstly, we will import spaCy, which is a popular Python library for natural language processing tasks and other necessary modules.
import spacyfrom spacy.lang.en.stop_words import STOP_WORDSfrom string import punctuation
Line 1: Imports spaCy library
Line 2– 3: Imports the stop words and punctuation from the spaCy library and string module, respectively.
In the second step, we process the corpus, that is, we remove the stop words and punctuation from the text.
# Sample input texttext = """"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been lanched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow. The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."""# Load the English language modelnlp = spacy.blank("en")# Convert text to lowercase and tokenize without removing stop words and punctuationdoc = nlp(text.lower())
Line 1–2: This is our sample text for summarization
Line 4: This line loads the English language model for spaCy. The blank("en")
method creates an empty pipeline without any pre-existing components.
Line 7: This line processes the text
variable using the spaCy language model nlp
. The text.lower()
converts the input text to lowercase before tokenization.
# Calculate word frequenciesword_frequencies = {}for token in doc:if token.text not in STOP_WORDS and token.text not in punctuation:if token.text not in word_frequencies:word_frequencies[token.text] = 1else:word_frequencies[token.text] += 1
Line 2: Initializes an empty dictionary to store word frequencies.
Line 3– 8: The code uses a loop to iterate through each token
in the doc
object. if token.text not in STOP_WORDS and token.text not in punctuation:
checks if the token is not a stop word (common words like "the," "and," etc.) or a punctuation mark (e.g., period, comma). If the token meets the above condition, the code checks whether the token's text is already in the word_frequencies
dictionary. If it is already in the dictionary, then it updates its frequency count. Otherwise, it stores 1 in place of the count.
sorted_sentences = sorted(doc.sents, key=lambda sent: sum(word_frequencies[token.text] for token in sent if token.text in word_frequencies), reverse=True)summary = " ".join(sent.text for sent in sorted_sentences[:3])
The code calculates the word frequencies for each word in the text, then ranks the sentences based on the importance of the words they contain. The top three most important sentences are selected to form the summary, providing a concise representation of the original text while highlighting its key points.
Firstly, we will import NLTK, a popular Python library, for natural language processing tasks and other necessary modules.
import nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import sent_tokenize, word_tokenizefrom string import punctuationfrom heapq import nlargest
Line 1– 5: Imports the NLTK library for natural language processing tasks. Also, imports the stop words and the sentence and word tokenizers along with punctuation marks from the string module.
In the second step, we process the corpus, that is, we remove the stopwords and punctuation from the text.
text = """"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been lanched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow. The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills.""""""# Convert text to lowercasetext = text.lower()# Tokenize the text into sentencessentences = sent_tokenize(text)# Tokenize the sentences into wordswords = [word_tokenize(sentence) for sentence in sentences]# Flatten the list of wordswords = [word for sublist in words for word in sublist]# Remove stopwords and punctuationstop_words = set(stopwords.words("english") + list(punctuation))words = [word for word in words if word not in stop_words]
Line 6: The code uses text.lower()
to convert the entire text to lowercase. This step ensures that the text is case-insensitive when performing further operations.
Line 7–12: The code uses sent_tokenize(text)
from NLTK to tokenize the text into individual sentences. The resulting sentences
variable is a list, then a list comprehension to tokenize each sentence in sentences
into words using word_tokenize(sentence)
. The resulting words
variable is a list of lists, where each inner list contains the words of a sentence.
Line 14–19: It creates a new list called words
, where each element is an individual word from the original sentences. The code gets the set of English stopwords using set(stopwords.words("english") + list(punctuation))
. It then filters out the stopwords and punctuation from the list of words using a list comprehension. The resulting words
list now contains only meaningful words without stopwords and punctuation.
# Calculate word frequenciesword_frequencies = nltk.FreqDist(words)# Calculate sentence scores based on word frequenciessentence_scores = {}for sentence in sentences:for word in word_tokenize(sentence):if word in word_frequencies:if sentence not in sentence_scores:sentence_scores[sentence] = word_frequencies[word]else:sentence_scores[sentence] += word_frequencies[word]
The code calculates word frequencies using nltk.FreqDist()
and then calculates sentence scores based on the presence and frequency of significant words in each sentence. The resulting sentence_scores
dictionary will be used to rank the sentences and select the most important ones for summarization.
# Get the top 3 sentences with highest scores as the summarysummary_sentences = nlargest(3, sentence_scores, key=sentence_scores.get)# Generate summarysummary = " ".join(summary_sentences)
The code extracts the top 3 sentences with the highest scores from the sentence_scores
dictionary and then joins these sentences to create the final summary. The summary represents the most important information extracted from the original text, making it a concise and informative summary of the text's key points.
In conclusion, combining HTML forms with CSS empowers web developers to create visually appealing, responsive, and user-friendly forms. The use of CSS provides flexibility, consistency, and creativity in the form design while adhering to the principles of separation of concerns and code optimization. By prioritizing form aesthetics and functionality, websites can enhance user engagement and overall user experience.
Free Resources