Page 247 - Ai_C10_Flipbook
P. 247
The steps involved in Bag of Words algorithm are:
• Text Normalisation: The collection of data is processed to get normalised corpus.
• Create Dictionary: This step will create a list of all unique words available in normalised corpus.
• Create Document Vectors: For each document in the corpus, create a list of unique words with its number of
occurrences.
• Create Document Vectors for all the Documents: Repeat Step 3 for all documents in the corpus to create a
“Document Vector Table”.
For example,
The corpus with 3 documents is given as follows:
• Document 1: Preeti said, “Preeti will go to the market.”
• Document 2: Pooja will join her.
• Document 3: Sonia & Ajay will not go.
Step 1 Text Normalisation
• Document 1: [preeti, said, will, go, to, the, market]
• Document 2: [pooja, join, her]
• Document 3: [sonia, ajay, not]
Step 2 Create Dictionary
• Here you will create a list of unique words
preeti said will go to the
market pooja join her sonia ajay
not
Step 3 Create a Document Vector
In this step, the list of words from the dictionary is written in the top row. Now, for each word in the document 1, if it
matches with the vocabulary in the dictionary, put a 1 under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0 under it. For example, document 1 vector will be:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Step 4 Repeat the above Steps for all Documents
In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
our Document Vector Table as shown below:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Doc-2 0 0 1 0 0 0 0 1 1 1 0 0 0
Doc-3 0 0 1 1 0 0 0 0 0 0 1 1 1
Term Frequency and Inverse Document Frequency (TFIDF)
The bag of words algorithm identifies the occurrence of words in each document within a given corpus. It helps us
understand that if a word occurs frequently in a document, it holds more significance for that document. For
instance, if we have a document about climate change, words like "climate" and "change" would appear frequently.
Natural Language Processing 245

