Page 249 - Ai_C10_Flipbook
P. 249

Document Frequency
                 Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
                 occurred in those documents. It is shown below using the above example:

                               preeti  said   will   go    to    the    market   pooja   join  her    sonia  ajay   not
                               1       1      3      2     1     1      1        1       1     1      1      1      1

                 Here, you can see that the document frequency of ‘go’ is 2 as it occurred in two documents on the other hand it is
                 3 for ‘will’ as it occurs in all three documents and rest of them occurred in just one document hence the document
                 frequency for them is 1.

                 Inverse Document Frequency
                 Inverse Document Frequency is obtained by dividing the document frequency of a specific word by the total
                 number of documents. It is shown below using the above example:
                                                                 Total Number of Documents
                                                           IDF =
                                                                    Document Frequency

                                preeti  said   will   go    to    the   market   pooja   join   her   sonia   ajay  not
                                  3      3      3     3      3     3       3       3      3      3      3      3     3

                                  1      1      3     2      1     1       1       1      1      1      1      1     1
                 Term Frequency-Inverse Document Frequency (TF-IDF)
                 Term Frequency-Inverse Document Frequency (TF-IDF) is calculated by multiplying Term Frequency of a word with
                 the log of Inverse Document Frequency.
                 Therefore, the formula of TFIDF for any word W becomes:
                                                       TFIDF(W) = TF(W) * log(IDF(W))

                 Here, log is to the base of 10.
                 Now, let us calculate the TFIDF(W) for the above example:

                 Document      preeti  said  will   go     to    the    market  pooja   join   her   sonia   ajay  not
                                  2*     1*    1*     1*    1*     1*      1*      0*     0*     0*     0*     0*    0*
                 Doc-1          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1      1      1     1
                                  0*     0*    1*     0*    0*     0*      0*      1*     1*     1*     0*     0*    0*
                 Doc-2          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1      1      1     1
                                  0*     0*    1*     1*    0*     0*      0*      0*     0*     0*     1*     1*    1*
                 Doc-3          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1      1      1     1
                 Note that, the value of log (1) is 0.

                 After calculating all the values, we get:
                 Document      preeti  said  will   go    to     the   market   pooja   join  her    sonia  ajay   not
                 Doc-1        0.9542  0.4771 0      0.1761 0.4771 0.4771 0.4771  0     0      0      0      0      0
                 Doc-2        0       0      0      0     0     0      0        0.4771  0.4771 0.4771 0     0      0
                 Doc-3        0       0      0      0.1761 0    0      0        0      0      0      0.4771 0.4771 0.4771
                 Finally, the words have been converted to numbers. These numbers are the values of each word for each document.
                 Hence after the end of the above process we get:

                    • Stopwords generally have high term frequencies in all documents but tend to have lower TF-IDF values.

                                                                                  Natural Language Processing   247
   244   245   246   247   248   249   250   251   252   253   254