Page 387 - AI Ver 3.0 class 10_Flipbook
P. 387

Term Frequency

                 Term Frequency is the frequency of a word in one document. Term frequency can easily be found from the document
                 vector table as shown in the above example:

                  Document     preeti said    will  go     to    the    market pooja join      her    sonia  ajay   not
                  Doc-1        2       1      1     1      1     1      1        0      0      0      0      0      0
                  Doc-2        0       0      1     0      0     0      0        1      1      1      0      0      0
                  Doc-3        0       0      1     1      0     0      0        0      0      0      1      1      1
                 Document Frequency
                 Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
                 occurred in those documents. It is shown below using the above example:

                               preeti  said   will   go    to    the    market   pooja   join  her    sonia  ajay   not
                               1       1      3      2     1     1      1        1       1     1      1      1      1
                 Here, you can see that the document frequency of ‘go’ is 2 as it occurred in two documents on the other hand it is
                 3 for ‘will’ as it occurs in all three documents and rest of them occurred in just one document hence the document
                 frequency for them is 1.
                 Inverse Document Frequency
                 Inverse Document Frequency is obtained by dividing the document frequency of a specific word by the total
                 number of documents. It is shown below using the above example:

                                                                 Total Number of Documents
                                                           IDF =
                                                                    Document Frequency

                                preeti  said   will   go    to    the   market   pooja   join   her   sonia   ajay  not
                                  3      3      3     3      3     3       3       3      3      3      3      3     3
                                  1      1      3     2      1     1       1       1      1      1      1      1     1

                 Term Frequency-Inverse Document Frequency (TF-IDF)
                 Term Frequency-Inverse Document Frequency (TF-IDF) is calculated by multiplying Term Frequency of a word with
                 the log of Inverse Document Frequency.
                 Therefore, the formula of TFIDF for any word W becomes:
                                                       TFIDF(W) = TF(W) * log(IDF(W))

                 Here, log is to the base of 10.
                 Now, let us calculate the TFIDF(W) for the above example:
                 Document      preeti  said  will   go    to     the    market  pooja   join   her   sonia   ajay  not
                                 2*      1*    1*     1*    1*     1*      1*      0*     0*     0*     0*     0*    0*
                 Doc-1          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1     1      1      1
                                 0*      0*    1*     0*    0*     0*      0*      1*     1*     1*     0*     0*    0*
                 Doc-2          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1     1      1      1
                                 0*      0*    1*     1*    0*     0*      0*      0*     0*     0*     1*     1*    1*
                 Doc-3          log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3  log   3
                                   1      1      3      2     1      1      1        1      1      1     1      1      1
                 Note that, the value of log (1) is 0.




                                                                          Natural Language Processing (Theory)  385
   382   383   384   385   386   387   388   389   390   391   392