Natural Language Processing (NLP) has many applications, so there is a pressing need for the respective libraries. One of these libraries is spaCy—an open-source library for NLP in Python.
Some of the salient features of spaCy are as follows:
Comprehensive support: Usually, most libraries have just Python or a handful of other high-level language support; spaCy allows us to code in a whopping 72+ languages.
Trained pipelines: Similar to Hugging Face, spaCy also features several trained pipelines in many languages.
Integration: Usually, NLTK developers find it hard to integrate with the DL libraries like PyTorch or TensorFlow. spaCy's programming is pretty integrable with these libraries and features custom models.
Diverse tasks: spaCy allows us to perform several tasks like named entity recognition (NER), part-of-speech (PoS) tagging, text classification, and others.
Similar to Hugging Face, spaCy provides us with the support of pipelines for easy training and deployment. Pipelines include a central configuration file containing all the hyperparameters and other metadata like the optimizer's choice and others.
The configuration file has a lot of advantages, like structured sections (as shown in the example below), reproducibility, and even automated checks.
[paths]train = nulldev = nullvectors = nullinit_tok2vec = null[system]seed = 0gpu_allocator = null[nlp]lang = nullpipeline = [][training.optimizer]@optimizers = "Adam.v1"beta1 = 0.9beta2 = 0.999L2_is_weight_decay = trueL2 = 0.01grad_clip = 1.0use_averages = falseeps = 1e-8learn_rate = 0.001
As we can see in the sample file above (shrunk for the sake of ease of reading), we can declare the whole pipeline's configuration in a pretty simple way, including the hyperparameters. During the testing time, it will be just a matter of replicating all those configurations (i.e., the same file) to infer under the same settings.
We can install it using pip as follows:
pip install -U spacy
Usually, the installation is followed by importing the pipeline as follows:
python -m spacy download en_core_web_sm
Importing spaCy is a simple process, as follows:
import spacy as sp
Let's get familiarized with the library by trying some tasks as well.
The library features a number of built-in tokenizers. For example, we can use the English tokenizer as follows:
tokenizer = spacy.load("en_core_web_sm")
Let's apply it to one of the most iconic opening lines from Herman Melville's classic Moby Dick (1851).
tokenizedText = tokenizer("Call me Ishmael. \Some years ago—never mind how long precisely—having little \or no money in my purse, and nothing particular to interest me on shore, \I thought I would sail about a little and see the watery part of the world. \It is a way I have of driving off the spleen, and regulating the circulation.")
Let's analyze this tokenized text.
We can pluck out parts of speech, for example, we'll take out the noun chunks below:
nounsList = [chunk.text for chunk in tokenizedText.noun_chunks]
Next, we'll take out verbs:
verbsList = [token.lemma_ for token in tokenizedText if token.pos_ == "VERB"]
We can simply apply NER by looping through the entities as follows:
for entity in tokenizedText.ents:print(entity.text, entity.label_)
We can pull entities from any (tokenized) text by calling .ents
. Next, we print them by looping them across.
We can test the examples above in the form of a consolidated code.
import spacy as sptokenizer = sp.load("en_core_web_sm")tokenizedText = tokenizer("Call me Ishmael. \Some years ago—never mind how long precisely—having little \or no money in my purse, and nothing particular to interest me on shore, \I thought I would sail about a little and see the watery part of the world. \It is a way I have of driving off the spleen, and regulating the circulation.")nounsList = [chunk.text for chunk in tokenizedText.noun_chunks]print(nounsList)
As we can see, noun_chunks
allows us to fetch all the nouns from the (tokenized) paragraph.
Free Resources