Sentence segmentation is the process of dividing a chunk of text or a paragraph into individual sentences. This task requires us to identify the boundaries that separate one sentence from another. It is a fundamental task in natural language processing (NLP) and is often an essential preprocessing step for NLP applications as it makes parsing and analysis easier.
The spaCy
library offers a very simple and easy way for sentence segmentation. We can use the sents
property, which is a part of the built-in Doc
class. spaCy
achieves this using a dependency parser; no other library uses such a sophisticated method of handling sentence segmentation. spaCy
also allows us to perform sentence segmentation in different languages by loading different language models.
For our example, we will be using the Spanish and French language models. Let's start with the Spanish example.
import spacynlp = spacy.load("es_core_news_sm")text = "¿Querías saber cuánto durará esto? Hasta la muerte"doc = nlp(text)for sent in doc.sents:print(sent.text)
Let's go over the code:
Line 1: We import the spacy
library.
Line 2: We load the Spanish language model.
Line 4–5: We store the Spanish text in a variable called text
and add it to an doc
object.
Line 7–8: We use the sents
property that is inside the doc
class to loop through the text and print the sentences.
Now let's look at the French example.
import spacynlp = spacy.load("fr_core_news_sm")text = "Frère Jacques Frère Jacques Dormez vous? Dormez vous? Sonnez les matines Sonnez les matines Ding ding dong Ding ding dong"doc = nlp(text)for sent in doc.sents:print(sent.text)
The code is largely the same except for two differences:
Line 2: We load a French language model.
Line 4: We add a French text that we want to be segmented.
Free Resources