Page 295 - AI Ver 1.0 Class 10
P. 295
Let us now study in details how these techniques can be used for Textual Data Processing in NLP.
Bag of Words
After the process of text normalisation the corpus is converted into normalised corpus which is just a collection of
meaningful words with no sequence.
Bag of Words is a simple and important technique used in Natural Language Processing for extracting features
from the textual data. It converts text sentences into numeric vectors by returning the unique words along with its
number of occurrences.
Bag of words (BoW)
Alan Turing was a brilliant British ('Alan', 2)
Mathematician, Biologist and ('Turing', 4)
Computer Scientist. His Turing ('the', 6)
Machine was one of the first, basic ('is', 4)
computers created. In 1950, Alan ('a', 7)
Turing published a ground breaking ('Test', 2)
seminal paper “Computing Machinery ('intelligence', 3)
('of', 5)
and Intelligence” on the topic of the ('artificial', 2)
artificial intelligence. It introduced Alan Test ('known', 1)
the concept of what is now known as intelligence ('as', 2)
Turing Test. in ('and', 3)
The test is still a matter of standards breaking known ('it', 3)
today. It establishes that if a computer as ('in', 1)
can have a simple dialogue with a ('ground', 1)
person via a printer, then that itself is artificial and ('breaking', 1)
a proof that the machine is “thinking”. ('mathematician', 1)
('machinery', 1)
It was for this work that led him to be .................
regarded as the Father of Theoretical
Computer Science and Artificial
Intelligence.
This algorithm is named as Bag of Words because it contains meaningful words (also known as Tokens)
scattered in a dataset just like a bag full of words scattered with no specific order. The Bag of Words algorithm
returns:
• A vocabulary of words for the corpus.
• The frequency of these words i.e., the number of occurrences of each word.
The Steps involved in Bag of Words algorithm are:
• Text Normalisation: The collection of data is processed to get normalised corpus.
• Create Dictionary: This step will create a list of all unique words available in normalised corpus.
• Create Document Vectors: For each document in the corpus, create a list of unique words with its number of
occurrences.
• Create Document Vectors for all the Documents: Repeat Step 3 for all documents in the corpus to create a
“Document Vector Table”.
Natural Language Processing 293

