Every sentence is made up of syntactic units or constituents that are combined together to mean something.
Constituents are groups of words behaving as single units and consist of phrases, words, or morphemes, including:
Symbol | Type | Example |
NP | Noun Phrases | "he," "the boy," "the man with the old black shoes" |
VP | Verb Phrases | "walked," "sit down and be quiet" |
PP | Prepositional Phrases | "on the floor," "with the paper," "apart from everything said before." |
The inventory of constituents plays a key role in language grammar development. In NLP, grammars govern the composition of a sentence and define the linear order in which words can occur in a sentence to be considered correct syntactically.
Two types of grammar are commonly used:
Context-free grammar (CFG), also known as constituency grammar or phrase structure grammar.
Dependency grammar.
Constituency grammar is drawn from a set of languages called context-free languages (CFL) and consists of a set of rules or productions stating how a constituent can be segmented into smaller constituents, up to the level of individual words.
Constituency grammar is defined by four parameters:
A set of non-terminals (aka variables), each denoting a set of strings.
A finite set of terminal symbols (lexicon), constituting the alphabet of the language considered.
A non-terminal starting symbol.
A list of rules called productions that recursively define the structure of the language. Each rule has the form A → s, where:
1) "A" is a non-terminal (variable) symbol on the left-hand side of the rule.
2) "s" is a sequence of terminals and non-terminals that might be empty.
Let's go through some grammar rules to better assimilate the concept:
Grammar Rule / Production | Description | Example |
S → NP + VP | A sentence can be composed of Noun Phrase + Verb Phrase | I + want a vacation. |
NP → Pronoun | A Noun Phrase can be composed of a Pronoun | I |
NP → Proper-Noun | A Noun Phrase can be composed of a proper noun | Las Vegas |
NP → Det Nominal | A Noun Phrase can be composed of a determiner (Det) followed by a Nominal | a + knight |
Nominal → Nominal Noun | A Nominal may consist of one or more Nouns | morning + flight |
PP → Preposition NP | A Preposition Phrase can be composed of a preposition followed by a Noun Phrase | from + Las Vegas |
VP → Verb | Verb Phrase | make |
VP → Verb NP | A Verb Phrase can be composed of a verb and a Noun Phrase | book + a flight |
VP → Verb NP PP | A Verb Phrase can be composed of a verb followed by a Noun Phrase and a Preposition Phrase | book + a flight + in the evening |
VP → Verb PP | A Verb Phrase can be composed of a verb and a Preposition Phrase | eating + in the evening |
Constituency grammar is widely used in compilers and helps in building phrase structure trees or parse trees.
Let us consider the following sentence: "The man eats the apple."
Based on the above derivation (sequence of application rules) we may conclude that this sentence is a valid and accepted sentence by grammar.
import nltk nltk.download('punkt') nltk.download('maxent_ne_chunker') nltk.download('words') nltk.download('treebank') nltk.download('averaged_perceptron_tagger') def extract_constituents(sentence): tokens = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokens) grammar = "NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammar) parse_tree = cp.parse(tagged) return parse_tree sentence = "The man eats the apple" constituents = extract_constituents(sentence) constituents.pprint() constituents.draw()