Page 385 - AI Ver 3.0 class 10_Flipbook
P. 385
• Document 2: Pooja will join her.
• Document 3: Sonia & Ajay will not go.
Step 1 Text Normalisation
• Document 1: [preeti, said, will, go, to, the, market]
• Document 2: [pooja, join, her]
• Document 3: [sonia, ajay, not]
Step 2 Create Dictionary
• Here you will create a list of unique words
preeti said will go to the
market pooja join her sonia ajay
not
Step 3 Create a Document Vector
In this step, the list of words from the dictionary is written in the top row. Now, for each word in the document 1, if it
matches with the vocabulary in the dictionary, put a 1 under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0 under it. For example, document 1 vector will be:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Step 4 Repeat the above Steps for all Documents
In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
our Document Vector Table as shown below:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Doc-2 0 0 1 0 0 0 0 1 1 1 0 0 0
Doc-3 0 0 1 1 0 0 0 0 0 0 1 1 1
21 st Century #Media Literacy
Skills
Video Session
Watch this video of Bag of Words (BoW) Intuition at the given link:
https://www.youtube.com/watch?v=uyEHxFu4XIo or scan the QR code and answer the
following question:
How does Bag of Words work?
Term Frequency and Inverse Document Frequency (TFIDF)
The bag of words algorithm identifies the occurrence of words in each document within a given corpus. It helps
us understand that if a word occurs frequently in a document, it holds more significance for that document. For
instance, if we have a document about climate change, words like "climate" and "change" would appear frequently.
Natural Language Processing (Theory) 383

