What is the spaCy library?

Natural Language Processing (NLP) has many applications, so there is a pressing need for the respective libraries. One of these libraries is spaCy—an open-source library for NLP in Python.

Features

Some of the salient features of spaCy are as follows:

  • Comprehensive support: Usually, most libraries have just Python or a handful of other high-level language support; spaCy allows us to code in a whopping 72+ languages.

  • Trained pipelines: Similar to Hugging Face, spaCy also features several trained pipelines in many languages.

  • Integration: Usually, NLTK developers find it hard to integrate with the DL libraries like PyTorch or TensorFlow. spaCy's programming is pretty integrable with these libraries and features custom models.

  • Diverse tasks: spaCy allows us to perform several tasks like named entity recognition (NER), part-of-speech (PoS) tagging, text classification, and others.

Pipeline

Similar to Hugging Face, spaCy provides us with the support of pipelines for easy training and deployment. Pipelines include a central configuration file containing all the hyperparameters and other metadata like the optimizer's choice and others.

The configuration file allows us to share hyperparameters and other settings across training and testing [image credits: spaCy documentation]
The configuration file allows us to share hyperparameters and other settings across training and testing [image credits: spaCy documentation]

The configuration file has a lot of advantages, like structured sections (as shown in the example below), reproducibility, and even automated checks.

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = null
pipeline = []
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 1e-8
learn_rate = 0.001

As we can see in the sample file above (shrunk for the sake of ease of reading), we can declare the whole pipeline's configuration in a pretty simple way, including the hyperparameters. During the testing time, it will be just a matter of replicating all those configurations (i.e., the same file) to infer under the same settings.

Installation

We can install it using pip as follows:

pip install -U spacy

Pipeline

Usually, the installation is followed by importing the pipeline as follows:

python -m spacy download en_core_web_sm

Usage

Importing spaCy is a simple process, as follows:

import spacy as sp

Let's get familiarized with the library by trying some tasks as well.

Tokenization

The library features a number of built-in tokenizers. For example, we can use the English tokenizer as follows:

tokenizer = spacy.load("en_core_web_sm")

Let's apply it to one of the most iconic opening lines from Herman Melville's classic Moby Dick (1851).

tokenizedText = tokenizer("Call me Ishmael. \
Some years ago—never mind how long precisely—having little \
or no money in my purse, and nothing particular to interest me on shore, \
I thought I would sail about a little and see the watery part of the world. \
It is a way I have of driving off the spleen, and regulating the circulation.")

Let's analyze this tokenized text.

Analysis

We can pluck out parts of speech, for example, we'll take out the noun chunks below:

nounsList = [chunk.text for chunk in tokenizedText.noun_chunks]

Next, we'll take out verbs:

verbsList = [token.lemma_ for token in tokenizedText if token.pos_ == "VERB"]

Named entity recognition (NER)

We can simply apply NER by looping through the entities as follows:

for entity in tokenizedText.ents:
print(entity.text, entity.label_)

We can pull entities from any (tokenized) text by calling .ents. Next, we print them by looping them across.

Consolidated example

We can test the examples above in the form of a consolidated code.

import spacy as sp
tokenizer = sp.load("en_core_web_sm")
tokenizedText = tokenizer("Call me Ishmael. \
Some years ago—never mind how long precisely—having little \
or no money in my purse, and nothing particular to interest me on shore, \
I thought I would sail about a little and see the watery part of the world. \
It is a way I have of driving off the spleen, and regulating the circulation.")
nounsList = [chunk.text for chunk in tokenizedText.noun_chunks]
print(nounsList)

As we can see, noun_chunks allows us to fetch all the nouns from the (tokenized) paragraph.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved