Page 385 - AI Ver 3.0 class 10_Flipbook
P. 385

• Document 2: Pooja will join her.
                    • Document 3: Sonia & Ajay will not go.

                 Step 1  Text Normalisation
                    • Document 1: [preeti, said, will, go, to, the, market]
                    • Document 2: [pooja, join, her]
                    • Document 3: [sonia, ajay, not]

                 Step 2  Create Dictionary
                    • Here you will create a list of unique words

                  preeti         said              will              go               to                the
                  market         pooja             join              her              sonia             ajay

                  not

                 Step 3  Create a Document Vector
                 In this step, the list of words from the dictionary is written in the top row. Now, for each word in the document 1, if it
                 matches with the vocabulary in the dictionary, put a 1 under it. If the same word appears again, increment the previous
                 value by 1. And if the word does not occur in that document, put a 0 under it. For example, document 1 vector will be:

                  Document       preeti  said   will  go   to   the   market    pooja    join  her    sonia  ajay   not
                  Doc-1          2       1      1     1    1    1     1         0        0     0      0      0      0


                 Step 4  Repeat the above Steps for all Documents
                 In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
                 our Document Vector Table as shown below:

                  Document      preeti  said   will   go   to   the   market    pooja   join  her    sonia   ajay  not
                  Doc-1         2       1      1      1    1    1     1         0       0     0      0       0     0
                  Doc-2         0       0      1      0    0    0     0         1       1     1      0       0     0
                  Doc-3         0       0      1      1    0    0     0         0       0     0      1       1     1


                                                                                             21 st  Century   #Media Literacy
                                                                                                 Skills
                           Video Session

                       Watch this video of Bag of Words (BoW) Intuition at the given link:

                       https://www.youtube.com/watch?v=uyEHxFu4XIo or scan the QR code and answer the
                       following question:
                       How does Bag of Words work?






                 Term Frequency and Inverse Document Frequency (TFIDF)

                 The bag of words algorithm identifies the occurrence of words in each document within a given corpus. It helps
                 us understand that if a word occurs frequently in a document, it holds more significance for that document. For
                 instance, if we have a document about climate change, words like "climate" and "change" would appear frequently.

                                                                          Natural Language Processing (Theory)  383
   380   381   382   383   384   385   386   387   388   389   390