Page 384 - AI Ver 3.0 class 10_Flipbook
P. 384
Techniques of Natural Language Processing
There are many techniques used in NLP for extracting information but the three given below are most commonly
used:
1. Bag of Words
2. Term Frequency and Inverse Document Frequency (TFIDF)
3. Natural Language Toolkit (NLTK)
Let us now study in detail how these techniques can be used for Textual Data Processing in NLP.
Bag of Words
After the process of text normalisation, the corpus is converted into a normalised corpus, which is a collection of
meaningful words with no sequence.
Bag of Words is a simple and important technique used in Natural Language Processing for extracting features
from the textual data. It converts text sentences into numeric vectors by returning the unique words along with its
number of occurrences.
Bag of words (BoW)
Johny Johny, yes Papa eating 1
Eating sugar? No, Papa. ha 3
Telling lies? No, Papa. johny 2
Open your mouth! Ha, lies 1
ha mouth 1
ha, ha! eating johny
no 2
lies
mouth no open 1
papa 3
papa
open sugar 1
sugar
telling telling 1
your yes 1
yes
your 1
This algorithm is named as Bag of Words because it contains meaningful words (also known as Tokens) scattered
in a dataset just like a bag full of words scattered with no specific order. The Bag of Words algorithm returns:
• A vocabulary of unique words for the corpus.
• The frequency of these words i.e., the number of occurrences of each word.
The steps involved in Bag of Words algorithm are:
• Text Normalisation: The collection of data is processed to get normalised corpus.
• Create Dictionary: This step will create a list of all unique words available in normalised corpus.
• Create Document Vectors: For each document in the corpus, create a list of unique words with its number of
occurrences.
• Create Document Vectors for all the Documents: Repeat Step 3 for all documents in the corpus to create a
“Document Vector Table”.
For example,
The corpus with 3 documents is given as follows:
• Document 1: Preeti said, “Preeti will go to the market.”
382 Touchpad Artificial Intelligence (Ver. 3.0)-X

