Page 297 - AI Ver 1.0 Class 10
P. 297

Step 4: Repeat the above Steps for all Documents

                 In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
                 our Document Vector Table as shown below:

                     I        like     oranges     also     bananas      and       are       good       for      health
                     1         1          1         0          0          0         0          0         0         0
                     1         1          0         1          1          0         0          0         0         0
                     0         0          1         0          1          1         1          1         1         1


                 Term Frequency and Inverse Document Frequency (TFIDF)
                 This method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each
                 word in the document but TFIDF through its numeric value gives us the importance of each word in the document.

                 TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
                 a numeric value as shown below:


                 Term Frequency
                 Term Frequency is the frequency of a word in one document. Term frequency can easily be found from the document
                 vector table as shown in the above example:

                     I        like     oranges     also     bananas      and        are      good       for      health
                     1         1          1         0          0          0         0          0         0         0
                     1         1          0         1          1          0         0          0         0         0
                     0         0          1         0          1          1         1          1         1         1


                 Document Frequency
                 Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
                 occurred in those documents. It is shown below using the above example:

                     I        like    oranges      also     bananas      and        are      good       for      health
                     2         2          2         1          2          1          1         1         1         1



                 Inverse Document Frequency
                 Inverse Document Frequency is obtained when document frequency is in the denominator and the total number
                 of documents is the numerator. It is shown below using the above example:

                      I        like    oranges     also     bananas      and       are       good       for      health
                    3/2        3/2       3/2       3/1        3/2        3/1       3/1        3/1       3/1       3/1


                 Therefore, the formula of TFIDF for any word W becomes:
                 TFIDF(W) = TF(W) * log(IDF(W))

                 Here, log is to the base of 10.







                                                                               Natural Language Processing  295
   292   293   294   295   296   297   298   299   300   301   302