Text summarization with sumy

Key takeaways

  • Text summarization condenses large amounts of information into concise summaries, helping users quickly grasp main points.

  • Sumy is a Python library that provides various algorithms for automatic text summarization, including LSA, LexRank, and Luhn.

  • Sumy can be installed via pip and also requires the NLTK library as sumy depends on it.

  • Summarizing with LSA analyzes relationships between terms and documents; good for topic distribution but computationally intensive.

  • Summarizing with LexRank is a graph-based approach similar to PageRank; effective for identifying central sentences but may vary in performance based on text structure.

  • Summarizing with Luhn is simple and fast; focuses on word frequency but may miss broader context in longer texts.

  • Each algorithm has its strengths and weaknesses, which should be considered based on the specific text and summarization needs.

In today’s world, we’re overwhelmed by more text than we can read—from news articles and research papers to blog posts and social media updates. Text summarization helps by condensing large amounts of text into short, relevant summaries, making it easier to quickly understand the main points. Sumy is a helpful Python library that simplifies this process, offering an easy and effective way to automatically summarize text.

Text summarization process
Text summarization process

In this Answer, we will explore the capabilities of sumy, demonstrate how to install and use it, provide executable examples and compare different summarization algorithms to help us choose the right one for our needs.

What is sumy?

Sumy is a Python library that provides various algorithms for text summarization. It supports several methods, including:

  • LSA (latent semantic analysis)

  • LexRank

  • Luhn

  • Edmundson

  • TextRank

  • KLSum

  • Reduction

Each algorithm has its own approach to extracting the most important sentences from a given text. We’ll demonstrate the first three widely used algorithms.

Installing sumy on a local machine

Before we can use sumy, we need to install it. Sumy can be installed using pip:

pip install sumy

Additionally, you may need to install the NLTK library and its datasets, which sumy relies on:

pip install nltk

After installing the necessary packages, open a Python interpreter or script and download the NLTK data:

import nltk
nltk.download('punkt')

Using sumy for text summarization

Let’s start by summarizing a sample text using different algorithms provided by sumy. We’ll use a simple text for demonstration purposes.

Basic setup

We need to import the necessary modules. After that, we define text variable that contains example text and sets up the parser.

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
# Example text for better demonstration
text = """
The quick brown fox jumps over the lazy dog. The dog, being very lazy, doesn't move.
The fox, enjoying its agility, continues to jump over the dog. Meanwhile, the dog just yawns and stretches.
This pattern continues for a while, making the fox quite happy and the dog remain lazy.
On the other side of the forest, a squirrel and a rabbit play together. They chase each other around the trees,
occasionally stopping to nibble on nuts and grass. The squirrel shows off its acrobatic skills, jumping from branch to branch,
while the rabbit bounds gracefully across the clearing.
As evening falls, all the animals settle down. The fox curls up under a tree, the squirrel finds a cozy spot in the branches,
and the rabbit digs a shallow burrow for the night. The lazy dog simply stretches out on the cool grass, content and peaceful.
"""
# Create a parser
parser = PlaintextParser.from_string(text, Tokenizer("english"))

Summarizing with LSA

Latent semantic analysis is a widely used NLP technique. It looks for the relationship between the collection of documents and the pool of terms that can be used to find out the meaning and pattern by looking into the frequency and co-occurrence of words. It is one of the algorithms sumy supports. Here’s how you can use it:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
# Example text for better demonstration
text = """
The quick brown fox jumps over the lazy dog. The dog, being very lazy, doesn't move.
The fox, enjoying its agility, continues to jump over the dog. Meanwhile, the dog just yawns and stretches.
This pattern continues for a while, making the fox quite happy and the dog remain lazy.
On the other side of the forest, a squirrel and a rabbit play together. They chase each other around the trees,
occasionally stopping to nibble on nuts and grass. The squirrel shows off its acrobatic skills, jumping from branch to branch,
while the rabbit bounds gracefully across the clearing.
As evening falls, all the animals settle down. The fox curls up under a tree, the squirrel finds a cozy spot in the branches,
and the rabbit digs a shallow burrow for the night. The lazy dog simply stretches out on the cool grass, content and peaceful.
"""
# Create a parser
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Create an LSA Summarizer
lsa_summarizer = LsaSummarizer()
# Summarize the text
lsa_summary = lsa_summarizer(parser.document, 4) # Summarize to 4 sentences
# Print the summary
print("Summary:")
for sentence in lsa_summary:
print(sentence)

Code explanation

Here is the explanation of the highlighted code.

  • Line 3: This line imports the LSA summarizer class for performing text summarization.

  • Line 23: This line creates an instance of the LSA summarizer.

  • Line 26: This line summarizes the parsed text into four sentences using the LSA summarizer.

  • Lines 29–31: These lines print each sentence of the generated summary.

Summarizing with LexRank

LexRank is an algorithm for text summarization based on the method of ranking the given sentences in the document. It employs the scores for their similarity in the determination of the importance of sentences and uses a graph-based approach. Here’s how to use it:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
# Example text for better demonstration
text = """
The quick brown fox jumps over the lazy dog. The dog, being very lazy, doesn't move.
The fox, enjoying its agility, continues to jump over the dog. Meanwhile, the dog just yawns and stretches.
This pattern continues for a while, making the fox quite happy and the dog remain lazy.
On the other side of the forest, a squirrel and a rabbit play together. They chase each other around the trees,
occasionally stopping to nibble on nuts and grass. The squirrel shows off its acrobatic skills, jumping from branch to branch,
while the rabbit bounds gracefully across the clearing.
As evening falls, all the animals settle down. The fox curls up under a tree, the squirrel finds a cozy spot in the branches,
and the rabbit digs a shallow burrow for the night. The lazy dog simply stretches out on the cool grass, content and peaceful.
"""
# Create a parser
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Create a LexRank Summarizer
lex_rank_summarizer = LexRankSummarizer()
# Summarize the text
lex_rank_summary = lex_rank_summarizer(parser.document, 4) # Summarize to 4 sentences
# Print the summary
print("Summary:")
for sentence in lex_rank_summary:
print(sentence)

Code explanation

Here is the explanation of the highlighted code.

  • Line 3: This line imports the LexRank summarizer class for performing text summarization.

  • Line 23: This line creates an instance of the LexRank summarizer.

  • Line 26: This line summarizes the parsed text into four sentences using the LexRank summarizer.

  • Lines 29–31: These lines print each sentence of the generated summary.

Summarizing with Luhn

Luhn is an algorithm for automatic text summarization that looks at which sentences in the text are most important. It analyzes how often certain words occur in the text and where in the text those words are located with an eye to which words contain the most information. Here’s how you can use it:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
# Example text for better demonstration
text = """
The quick brown fox jumps over the lazy dog. The dog, being very lazy, doesn't move.
The fox, enjoying its agility, continues to jump over the dog. Meanwhile, the dog just yawns and stretches.
This pattern continues for a while, making the fox quite happy and the dog remain lazy.
On the other side of the forest, a squirrel and a rabbit play together. They chase each other around the trees,
occasionally stopping to nibble on nuts and grass. The squirrel shows off its acrobatic skills, jumping from branch to branch,
while the rabbit bounds gracefully across the clearing.
As evening falls, all the animals settle down. The fox curls up under a tree, the squirrel finds a cozy spot in the branches,
and the rabbit digs a shallow burrow for the night. The lazy dog simply stretches out on the cool grass, content and peaceful.
"""
# Create a parser
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Create a Luhn Summarizer
luhn_summarizer = LuhnSummarizer()
# Summarize the text
luhn_summary = luhn_summarizer(parser.document, 4) # Summarize to 4 sentences
# Print the summary
print("Summary:")
for sentence in luhn_summary:
print(sentence)

Code explanation

Here is the explanation of the highlighted code.

  • Line 3: This line imports the Luhn summarizer class for performing text summarization.

  • Line 23: This line creates an instance of the Luhn summarizer.

  • Line 26: This line summarizes the parsed text into four sentences using the Luhn summarizer.

  • Lines 29–31: These lines print each sentence of the generated summary.

Comparing algorithms

Now that we’ve seen how to use different algorithms, let’s compare them.

Algorithm

Pros

Cons

LSA

Captures the underlying structure of the text

Can be computationally intensive


Good for understanding the general topic distribution

May not always pick the most representative sentences

LexRank

Uses a graph-based approach similar to Google's PageRank

May require more computation due to graph construction


Effective for identifying central sentences in the text

Performance can vary based on the text structure

Luhn

Simple and fast

Can be less effective for longer texts with varied vocabulary


Effective for texts where significant words are indicative of the main content

Might miss the broader context

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved