NLP Examples¶
Today, Natual process learning technology is widely used technology. Here, are common Application’ of NLP:
Information retrieval & Web Search
- Google, Yahoo, Bing, and other search engines base their machine translation technology on NLP deep learning models. It allows algorithms to read text on a webpage, interpret its meaning and translate it to another language.
Grammar Correction:
- NLP technique is widely used by word processor software like MS-word for spelling correction & grammar checking.
Text Analytics has lots of applications in today’s online world. By analyzing tweets on Twitter, we can find trending news and peoples reaction on a particular event. Amazon can understand user feedback or review on the specific product. BookMyShow can discover people’s opinion about the movie. Youtube can also analyze and understand peoples viewpoints on a video.
Tokenize Words and Sentences with NLTK¶
What is Tokenization?
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.
- Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
Natural Language toolkit has very important module tokenize which further compromises of sub-modules
- word tokenize
- sentence tokenize
why Sentence Tokenization
is needed when we have the option of word tokenization. Imagine you need to count average words per sentence
, how you will calculate? For accomplishing such a task, you need both sentence tokenization as well as words to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.
#!pip install nltk
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True
Choose to download “all” for all packages, and then click ‘download.’ This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can elect to selectively download everything manually. The NLTK module will take up about 7MB, and the entire nltk_data directory will take up about 1.8GB
, which includes your chunkers, parsers, and the corpora.
from nltk.tokenize import word_tokenize, sent_tokenize
sample_sentence="Hi I am Learning NLP with Itronix Solutions"
a=word_tokenize(sample_sentence)
print(a)
print(sample_sentence.split())
sample_sentence2="Hi I am Learning NLP with Mr. Karan Arora"
b=sent_tokenize(sample_sentence2)
print(b)
['Hi', 'I', 'am', 'Learning', 'NLP', 'with', 'Itronix', 'Solutions'] ['Hi', 'I', 'am', 'Learning', 'NLP', 'with', 'Itronix', 'Solutions'] ['Hi I am Learning NLP with Mr. Karan Arora']
from nltk.tokenize import sent_tokenize, word_tokenize
EXAMPLE_TEXT = "Hello Mr. Karan Arora, how are you doing today? The weather is great, and Python is awesome. The sky is blue. You shouldn't eat chicken."
print(sent_tokenize(EXAMPLE_TEXT))
['Hello Mr. Karan Arora, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is blue.', "You shouldn't eat chicken."]
At first, you may think tokenizing by things like words or sentences is a rather trivial enterprise. For many sentences it can be. The first step would be likely doing a simple .split('. ')
, or splitting by period followed by a space. Then maybe you would bring in some regular expressions to split by period, space, and then a capital letter. The problem is that things like Mr. Karan Arora
would cause you trouble, and many other things. Splitting by word is also a challenge, especially when considering things like concatenations like we and are to we’re. NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation.
print(word_tokenize(EXAMPLE_TEXT))
['Hello', 'Mr.', 'Karan', 'Arora', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'blue', '.', 'You', 'should', "n't", 'eat', 'chicken', '.']
There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word “shouldn’t” into “should” and “n’t.”¶
#C:\Users\Karan\AppData\Roaming\nltk_data\tokenizers\punkt
import nltk.data
path = "tokenizers\\punkt\\english.pickle"
tokenizer = nltk.data.load(path)
EXAMPLE_TEXT = "Hello Mr. Karan Arora, how are you doing today? The weather is great, and Python is awesome. The sky is blue. You shouldn't eat chicken."
a=tokenizer.tokenize(EXAMPLE_TEXT)
print(a)
['Hello Mr. Karan Arora, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is blue.', "You shouldn't eat chicken."]
from nltk.tokenize import word_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
tok2=TreebankWordTokenizer() #use short form
tok3=WordPunctTokenizer()
sent="Hi my name is Karan"
print(word_tokenize(sent))
print(tok2.tokenize(sent)) #or TreebankWordTokenizer().tokenize(sent)
print(tok3.tokenize(sent))
sent2="I won't let you bring wine"
print(word_tokenize(sent2))
print(tok2.tokenize(sent2))
print(tok3.tokenize(sent2))
['Hi', 'my', 'name', 'is', 'Karan'] ['Hi', 'my', 'name', 'is', 'Karan'] ['Hi', 'my', 'name', 'is', 'Karan'] ['I', 'wo', "n't", 'let', 'you', 'bring', 'wine'] ['I', 'wo', "n't", 'let', 'you', 'bring', 'wine'] ['I', 'won', "'", 't', 'let', 'you', 'bring', 'wine']
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize
sent="I can't do this. I won't do that"
print(word_tokenize(sent))
print(regexp_tokenize(sent,"[\w']+"))
print(regexp_tokenize(sent,"[\w]+"))
print(regexp_tokenize(sent,"[\w]"))
['I', 'ca', "n't", 'do', 'this', '.', 'I', 'wo', "n't", 'do', 'that'] ['I', "can't", 'do', 'this', 'I', "won't", 'do', 'that'] ['I', 'can', 't', 'do', 'this', 'I', 'won', 't', 'do', 'that'] ['I', 'c', 'a', 'n', 't', 'd', 'o', 't', 'h', 'i', 's', 'I', 'w', 'o', 'n', 't', 'd', 'o', 't', 'h', 'a', 't']
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w']+")
tokenizer.tokenize(sent)
['I', "can't", 'do', 'this', 'I', "won't", 'do', 'that']
POS (Part-Of-Speech) Tagging & Chunking with NLTK¶
POS Tagging¶
Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
e.g.
- Input: Everything to permit us.
- Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]
Steps Involved:
- Tokenize text (word_tokenize)
- apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Some examples are as below:¶
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ 's)
PRP personal pronoun (hers, herself, him,himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)
s="Everything to permit us"
l=word_tokenize(s)
nltk.pos_tag(tokenize_text)
['Everything', 'to', 'permit', 'us']