Page 247 - Ai_C10_Flipbook
P. 247

The steps involved in Bag of Words algorithm are:
                    • Text Normalisation: The collection of data is processed to get normalised corpus.
                    • Create Dictionary: This step will create a list of all unique words available in normalised corpus.
                    • Create Document Vectors: For each document in the corpus, create a list of unique words with its number of
                   occurrences.
                    • Create Document Vectors for all the Documents: Repeat Step 3 for all documents in the corpus to create a
                   “Document Vector Table”.
                 For example,
                 The corpus with 3 documents is given as follows:

                    • Document 1: Preeti said, “Preeti will go to the market.”
                    • Document 2: Pooja will join her.
                    • Document 3: Sonia & Ajay will not go.
                 Step 1  Text Normalisation

                    • Document 1: [preeti, said, will, go, to, the, market]
                    • Document 2: [pooja, join, her]
                    • Document 3: [sonia, ajay, not]
                 Step 2  Create Dictionary

                    • Here you will create a list of unique words

                  preeti         said              will              go               to                the
                  market         pooja             join              her              sonia             ajay

                  not

                 Step 3  Create a Document Vector
                 In this step, the list of words from the dictionary is written in the top row. Now, for each word in the document 1, if it
                 matches with the vocabulary in the dictionary, put a 1 under it. If the same word appears again, increment the previous
                 value by 1. And if the word does not occur in that document, put a 0 under it. For example, document 1 vector will be:

                  Document       preeti  said  will   go   to   the   market    pooja    join  her    sonia  ajay   not
                  Doc-1          2       1     1      1    1    1     1         0        0     0      0      0      0

                 Step 4  Repeat the above Steps for all Documents
                 In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
                 our Document Vector Table as shown below:

                  Document      preeti  said   will  go    to   the   market    pooja   join  her    sonia   ajay  not

                  Doc-1         2       1      1     1     1    1     1         0       0     0      0       0     0
                  Doc-2         0       0      1     0     0    0     0         1       1     1      0       0     0
                  Doc-3         0       0      1     1     0    0     0         0       0     0      1       1     1

                 Term Frequency and Inverse Document Frequency (TFIDF)

                 The bag of words algorithm identifies the occurrence of words in each document within a given corpus. It helps us
                 understand that if a word occurs frequently in a document, it holds more significance for that document. For
                 instance, if we have a document about climate change, words like "climate" and "change" would appear frequently.

                                                                                  Natural Language Processing   245
   242   243   244   245   246   247   248   249   250   251   252