POS Tagging

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
e.g.
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved:
- Tokenize text (word_tokenize)
- apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Some examples are as below:
Abbreviation | Meaning |
CC | coordinating conjunction |
CD | cardinal digit |
DT | determiner |
EX | existential there |
FW | foreign word |
IN | preposition/subordinating conjunction |
JJ | adjective (large) |
JJR | adjective, comparative (larger) |
JJS | adjective, superlative (largest) |
LS | list market |
MD | modal (could, will) |
NN | noun, singular (cat, tree) |
NNS | noun plural (desks) |
NNP | proper noun, singular (sarah) |
NNPS | proper noun, plural (indians or americans) |
PDT | predeterminer (all, both, half) |
POS | possessive ending (parent\ 's) |
PRP | personal pronoun (hers, herself, him,himself) |
PRP$ | possessive pronoun (her, his, mine, my, our ) |
RB | adverb (occasionally, swiftly) |
RBR | adverb, comparative (greater) |
RBS | adverb, superlative (biggest) |
RP | particle (about) |
TO | infinite marker (to) |
UH | interjection (goodbye) |
VB | verb (ask) |
VBG | verb gerund (judging) |
VBD | verb past tense (pleaded) |
VBN | verb past participle (reunified) |
VBP | verb, present tense not 3rd person singular(wrap) |
VBZ | verb, present tense with 3rd person singular (bases) |
WDT | wh-determiner (that, what) |
WP | wh- pronoun (who) |
WRB | wh- adverb (how) |
POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of NLTK is complete.
Chunking
Chunking is used to add more structure to the sentence by following parts of speech (POS) tagging. It is also known as shallow parsing. The resulted group of words is called "chunks." In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level. Shallow Parsing is also called light parsing or chunking.
The primary usage of chunking is to make a group of "noun phrases." The parts of speech are combined with regular expressions.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Following table shows what the various symbol means:
Name of symbol | Description |
. | Any character except new line |
* | Match 0 or more repetitions |
? | Match 0 or 1 repetitions |
Now Let us write the code to understand rule better
from nltk import pos_tag from nltk import RegexpParser text ="learn php from guru99 and make study easy".split() print("After Split:",text) tokens_tag = pos_tag(text) print("After Token:",tokens_tag) patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}""" chunker = RegexpParser(patterns) print("After Regex:",chunker) output = chunker.parse(tokens_tag) print("After Chunking",output)
Output
After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy'] After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')] After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'> After Chunking (S (mychunk learn/JJ) (mychunk php/NN) from/IN (mychunk guru99/NN and/CC) make/VB (mychunk study/NN easy/JJ))
The conclusion from the above example: "make" is a verb which is not included in the rule, so it is not tagged as mychunk
Use Case of Chunking
Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention
Example: Temperature of New York. Here Temperature is the intention and New York is an entity.
In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to understand how chunking is used to select the tokens. In this example, you will see the graph which will correspond to a chunk of a noun phrase. We will write the code and draw the graph for better understanding.
Code to Demonstrate Use Case
import nltk text = "learn php from guru99" tokens = nltk.word_tokenize(text) print(tokens) tag = nltk.pos_tag(tokens) print(tag) grammar = "NP: {<DT>?<JJ>*<NN>}" cp =nltk.RegexpParser(grammar) result = cp.parse(tag) print(result) result.draw() # It will draw the pattern graphically which can be seen in Noun Phrase chunking
Output:
['learn', 'php', 'from', 'guru99'] -- These are the tokens [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')] -- These are the pos_tag (S (NP learn/JJ php/NN) from/IN (NP guru99/NN)) -- Noun Phrase Chunking
Graph
Noun Phrase chunking Graph
From the graph, we can conclude that "learn" and "guru99" are two different tokens but are categorized as Noun Phrase whereas token "from" does not belong to Noun Phrase.
Chunking is used to categorize different tokens into the same chunk. The result will depend on grammar which has been selected. Further chunking is used to tag patterns and to explore text corpora.
0 Comments