About Me

If the idea of hacking as a career excites you, you will benefit greatly from completing this training here on Hacking Adda. You will learn how to exploit networks in the manner of an attacker, in order to find out how protect the system from them. Those interested in earning their Certified Ethical Hacker (CEH) will want to start by taking this course for FREE.

POS (Part-Of-Speech) Tagging & Chunking with NLTK

POS Tagging

Image result for Stemming and Lemmatization with Python NLTK
Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
e.g.
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved:
  • Tokenize text (word_tokenize)
  • apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Some examples are as below:
AbbreviationMeaning
CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there
FWforeign word
INpreposition/subordinating conjunction
JJadjective (large)
JJRadjective, comparative (larger)
JJSadjective, superlative (largest)
LSlist market
MDmodal (could, will)
NNnoun, singular (cat, tree)
NNSnoun plural (desks)
NNPproper noun, singular (sarah)
NNPSproper noun, plural (indians or americans)
PDTpredeterminer (all, both, half)
POSpossessive ending (parent\ 's)
PRPpersonal pronoun (hers, herself, him,himself)
PRP$possessive pronoun (her, his, mine, my, our )
RBadverb (occasionally, swiftly)
RBRadverb, comparative (greater)
RBSadverb, superlative (biggest)
RPparticle (about)
TOinfinite marker (to)
UHinterjection (goodbye)
VBverb (ask)
VBGverb gerund (judging)
VBDverb past tense (pleaded)
VBNverb past participle (reunified)
VBPverb, present tense not 3rd person singular(wrap)
VBZverb, present tense with 3rd person singular (bases)
WDTwh-determiner (that, what)
WPwh- pronoun (who)
WRBwh- adverb (how)
POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of NLTK is complete.

Chunking

Chunking is used to add more structure to the sentence by following parts of speech (POS) tagging. It is also known as shallow parsing. The resulted group of words is called "chunks." In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level. Shallow Parsing is also called light parsing or chunking.
The primary usage of chunking is to make a group of "noun phrases." The parts of speech are combined with regular expressions.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Following table shows what the various symbol means:
Name of symbolDescription
.Any character except new line
*Match 0 or more repetitions
?Match 0 or 1 repetitions
Now Let us write the code to understand rule better
from nltk import pos_tag
from nltk import RegexpParser
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
Output
After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
  (mychunk learn/JJ)
  (mychunk php/NN)
  from/IN
  (mychunk guru99/NN and/CC)
  make/VB
  (mychunk study/NN easy/JJ))
The conclusion from the above example: "make" is a verb which is not included in the rule, so it is not tagged as mychunk

Use Case of Chunking

Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention
Example: 
Temperature of New York. 
Here Temperature is the intention and New York is an entity. 
In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to understand how chunking is used to select the tokens. In this example, you will see the graph which will correspond to a chunk of a noun phrase. We will write the code and draw the graph for better understanding.

Code to Demonstrate Use Case

 import nltk
text = "learn php from guru99"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()    # It will draw the pattern graphically which can be seen in Noun Phrase chunking 
Output:
['learn', 'php', 'from', 'guru99']  -- These are the tokens
[('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')]   -- These are the pos_tag
(S (NP learn/JJ php/NN) from/IN (NP guru99/NN))        -- Noun Phrase Chunking
Graph
Noun Phrase chunking Graph
From the graph, we can conclude that "learn" and "guru99" are two different tokens but are categorized as Noun Phrase whereas token "from" does not belong to Noun Phrase.
Chunking is used to categorize different tokens into the same chunk. The result will depend on grammar which has been selected. Further chunking is used to tag patterns and to explore text corpora.

Post a Comment

0 Comments