What is constituency grammar in NLP?

Every sentence is made up of syntactic units or constituents that are combined together to mean something.

Constituents are groups of words behaving as single units and consist of phrases, words, or morphemes, including:

Symbol

Type

Example

NP

Noun Phrases

"he," "the boy," "the man with the old black shoes"

VP

Verb Phrases

"walked," "sit down and be quiet"

PP

Prepositional Phrases

"on the floor," "with the paper," "apart from everything said before."

The inventory of constituents plays a key role in language grammar development. In NLP, grammars govern the composition of a sentence and define the linear order in which words can occur in a sentence to be considered correct syntactically.
Two types of grammar are commonly used:

  • Context-free grammar (CFG), also known as constituency grammar or phrase structure grammar.

  • Dependency grammar.

What is constituency grammar?

Constituency grammar is drawn from a set of languages called context-free languages (CFL) and consists of a set of rules or productions stating how a constituent can be segmented into smaller constituents, up to the level of individual words.

Constituency grammar is defined by four parameters:

  • A set of non-terminals (aka variables), each denoting a set of strings.

  • A finite set of terminal symbols (lexicon), constituting the alphabet of the language considered.

  • A non-terminal starting symbol.

  • A list of rules called productions that recursively define the structure of the language. Each rule has the form A → s, where:
    1) "A" is a non-terminal (variable) symbol on the left-hand side of the rule.
    2) "s" is a sequence of terminals and non-terminals that might be empty.

Let's go through some grammar rules to better assimilate the concept:

Grammar

Rule / Production

Description

Example

S → NP + VP

A sentence can be composed of Noun Phrase + Verb Phrase

I + want a vacation.

NP → Pronoun

A Noun Phrase can be composed of a Pronoun

I

NP → Proper-Noun

A Noun Phrase can be composed of a proper noun

Las Vegas

NP → Det Nominal

A Noun Phrase can be composed of a determiner (Det) followed by a Nominal

a + knight

Nominal → Nominal Noun

A Nominal may consist of one or more Nouns

morning + flight

PP → Preposition NP

A Preposition Phrase can be composed of a preposition followed by a Noun Phrase

from + Las Vegas

VP → Verb

Verb Phrase

make

VP → Verb NP

A Verb Phrase can be composed of a verb and a Noun Phrase

book + a flight

VP → Verb NP PP

A Verb Phrase can be composed of a verb followed by a Noun Phrase and a Preposition Phrase

book + a flight + in the evening

VP → Verb PP

A Verb Phrase can be composed of a verb and a Preposition Phrase

eating + in the evening

Constituency grammar is widely used in compilers and helps in building phrase structure trees or parse trees.


Let us consider the following sentence: "The man eats the apple."

%0 node_1 S node_2 NP node_1->node_2 node_3 VP node_1->node_3 node_1674212808757 Det node_2->node_1674212808757 node_1674212828475 Nominal node_2->node_1674212828475 node_1674212878023 Verb node_3->node_1674212878023 node_1674212897452 NP node_3->node_1674212897452 node_1674213025338 The node_1674212808757->node_1674213025338 node_1674212920652 Noun node_1674212828475->node_1674212920652 node_1674213083844 eats node_1674212878023->node_1674213083844 node_1674212955076 Det node_1674212897452->node_1674212955076 node_1674212976900 Nominal node_1674212897452->node_1674212976900 node_1674213068339 man node_1674212920652->node_1674213068339 node_1674213128060 the node_1674212955076->node_1674213128060 node_1674213007694 Noun node_1674212976900->node_1674213007694 node_1674213141204 apple node_1674213007694->node_1674213141204
Parse tree

Based on the above derivation (sequence of application rules) we may conclude that this sentence is a valid and accepted sentence by grammar.

Implementation

import nltk
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')

def extract_constituents(sentence):
    tokens = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokens)
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    cp = nltk.RegexpParser(grammar)
    parse_tree = cp.parse(tagged)
    return parse_tree

sentence = "The man eats the apple"
constituents = extract_constituents(sentence)

constituents.pprint()

constituents.draw()

Free Resources