NLP Tokenizer

1. NLP Tokenizer

NLP Examples¶

Today, Natual process learning technology is widely used technology. Here, are common Application’ of NLP:

Information retrieval & Web Search

Google, Yahoo, Bing, and other search engines base their machine translation technology on NLP deep learning models. It allows algorithms to read text on a webpage, interpret its meaning and translate it to another language.

Grammar Correction:

NLP technique is widely used by word processor software like MS-word for spelling correction & grammar checking.

Text Analytics has lots of applications in today’s online world. By analyzing tweets on Twitter, we can find trending news and peoples reaction on a particular event. Amazon can understand user feedback or review on the specific product. BookMyShow can discover people’s opinion about the movie. Youtube can also analyze and understand peoples viewpoints on a video.

Tokenize Words and Sentences with NLTK¶

What is Tokenization?

Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

Natural Language toolkit has very important module tokenize which further compromises of sub-modules

word tokenize
sentence tokenize

why Sentence Tokenization is needed when we have the option of word tokenization. Imagine you need to count average words per sentence, how you will calculate? For accomplishing such a task, you need both sentence tokenization as well as words to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.

NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

In [1]:

#!pip install nltk
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

Out[1]:

True

Choose to download “all” for all packages, and then click ‘download.’ This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can elect to selectively download everything manually. The NLTK module will take up about 7MB, and the entire nltk_data directory will take up about 1.8GB, which includes your chunkers, parsers, and the corpora.

In [5]:

from nltk.tokenize import word_tokenize, sent_tokenize
sample_sentence="Hi I am Learning NLP with Itronix Solutions"
a=word_tokenize(sample_sentence)
print(a)
print(sample_sentence.split())

sample_sentence2="Hi I am Learning NLP with Mr. Karan Arora"
b=sent_tokenize(sample_sentence2)
print(b)

['Hi', 'I', 'am', 'Learning', 'NLP', 'with', 'Itronix', 'Solutions']
['Hi', 'I', 'am', 'Learning', 'NLP', 'with', 'Itronix', 'Solutions']
['Hi I am Learning NLP with Mr. Karan Arora']

In [ ]:

In [6]:

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Karan Arora, how are you doing today? The weather is great, and Python is awesome. The sky is blue. You shouldn't eat chicken."

print(sent_tokenize(EXAMPLE_TEXT))

['Hello Mr. Karan Arora, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is blue.', "You shouldn't eat chicken."]

At first, you may think tokenizing by things like words or sentences is a rather trivial enterprise. For many sentences it can be. The first step would be likely doing a simple .split('. '), or splitting by period followed by a space. Then maybe you would bring in some regular expressions to split by period, space, and then a capital letter. The problem is that things like Mr. Karan Arora would cause you trouble, and many other things. Splitting by word is also a challenge, especially when considering things like concatenations like we and are to we’re. NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation.

In [7]:

print(word_tokenize(EXAMPLE_TEXT))

['Hello', 'Mr.', 'Karan', 'Arora', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'blue', '.', 'You', 'should', "n't", 'eat', 'chicken', '.']

There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word “shouldn’t” into “should” and “n’t.”¶

In [8]:

#C:\Users\Karan\AppData\Roaming\nltk_data\tokenizers\punkt
import nltk.data
path = "tokenizers\\punkt\\english.pickle"
tokenizer = nltk.data.load(path)
EXAMPLE_TEXT = "Hello Mr. Karan Arora, how are you doing today? The weather is great, and Python is awesome. The sky is blue. You shouldn't eat chicken."
a=tokenizer.tokenize(EXAMPLE_TEXT)
print(a)

['Hello Mr. Karan Arora, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is blue.', "You shouldn't eat chicken."]

In [9]:

from nltk.tokenize import word_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
tok2=TreebankWordTokenizer() #use short form
tok3=WordPunctTokenizer()
sent="Hi my name is Karan"
print(word_tokenize(sent))
print(tok2.tokenize(sent)) #or TreebankWordTokenizer().tokenize(sent)
print(tok3.tokenize(sent))

sent2="I won't let you bring wine"
print(word_tokenize(sent2))
print(tok2.tokenize(sent2))
print(tok3.tokenize(sent2))

['Hi', 'my', 'name', 'is', 'Karan']
['Hi', 'my', 'name', 'is', 'Karan']
['Hi', 'my', 'name', 'is', 'Karan']
['I', 'wo', "n't", 'let', 'you', 'bring', 'wine']
['I', 'wo', "n't", 'let', 'you', 'bring', 'wine']
['I', 'won', "'", 't', 'let', 'you', 'bring', 'wine']

In [10]:

from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize
sent="I can't do this. I won't do that"
print(word_tokenize(sent))
print(regexp_tokenize(sent,"[\w']+"))
print(regexp_tokenize(sent,"[\w]+"))
print(regexp_tokenize(sent,"[\w]"))

['I', 'ca', "n't", 'do', 'this', '.', 'I', 'wo', "n't", 'do', 'that']
['I', "can't", 'do', 'this', 'I', "won't", 'do', 'that']
['I', 'can', 't', 'do', 'this', 'I', 'won', 't', 'do', 'that']
['I', 'c', 'a', 'n', 't', 'd', 'o', 't', 'h', 'i', 's', 'I', 'w', 'o', 'n', 't', 'd', 'o', 't', 'h', 'a', 't']

In [11]:

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w']+")
tokenizer.tokenize(sent)

Out[11]:

['I', "can't", 'do', 'this', 'I', "won't", 'do', 'that']

POS (Part-Of-Speech) Tagging & Chunking with NLTK¶

POS Tagging¶

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.

e.g.

Input: Everything to permit us.
Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]

Steps Involved:

Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

Some examples are as below:¶

Abbreviation 	Meaning
CC 	coordinating conjunction
CD 	cardinal digit
DT 	determiner
EX 	existential there
FW 	foreign word
IN 	preposition/subordinating conjunction
JJ 	adjective (large)
JJR 	adjective, comparative (larger)
JJS 	adjective, superlative (largest)
LS 	list market
MD 	modal (could, will)
NN 	noun, singular (cat, tree)
NNS 	noun plural (desks)
NNP 	proper noun, singular (sarah)
NNPS 	proper noun, plural (indians or americans)
PDT 	predeterminer (all, both, half)
POS 	possessive ending (parent\ 's)
PRP 	personal pronoun (hers, herself, him,himself)
PRP$ 	possessive pronoun (her, his, mine, my, our )
RB 	adverb (occasionally, swiftly)
RBR 	adverb, comparative (greater)
RBS 	adverb, superlative (biggest)
RP 	particle (about)
TO 	infinite marker (to)
UH 	interjection (goodbye)
VB 	verb (ask)
VBG 	verb gerund (judging)
VBD 	verb past tense (pleaded)
VBN 	verb past participle (reunified)
VBP 	verb, present tense not 3rd person singular(wrap)
VBZ 	verb, present tense with 3rd person singular (bases)
WDT 	wh-determiner (that, what)
WP 	wh- pronoun (who)
WRB 	wh- adverb (how)

In [12]:

s="Everything to permit us"
l=word_tokenize(s)
nltk.pos_tag(tokenize_text)

Out[12]:

['Everything', 'to', 'permit', 'us']

In [ ]:

Machine Learning Tutorials, Courses and Certifications

NLP Tokenizer

Related Articles

NLP Examples¶

Tokenize Words and Sentences with NLTK¶

There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word “shouldn’t” into “should” and “n’t.”¶

POS (Part-Of-Speech) Tagging & Chunking with NLTK¶

POS Tagging¶

Some examples are as below:¶

Related

About Machine Learning

Check Also

Corpus Wordnet

Leave a Reply Cancel reply

OpenCV Python Project for Bus Detection from an Image

Multiple Linear Regression:

Microsoft AI Classroom Series Assessment Answers

Polynomial Regression

Support Vector Regression

F1 Score

NLP – Natural Language Processing

Python MySQL Update Table

Simplifying data pipelines with Apache Kafka Cognitive Class Exam Answers:-

Machine Learning with Python Cognitive Class Exam Answers:-

OpenCV Python Project for Bus Detection from an Image

OpenCV Python Project for Vehicle Detection From an Image

OpenCV Python Project for Vehicle Detection in a Video frame

Airline Quality Service

Airport Quality Service