Page 387 - AI Ver 3.0 class 10_Flipbook
P. 387
Term Frequency
Term Frequency is the frequency of a word in one document. Term frequency can easily be found from the document
vector table as shown in the above example:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Doc-2 0 0 1 0 0 0 0 1 1 1 0 0 0
Doc-3 0 0 1 1 0 0 0 0 0 0 1 1 1
Document Frequency
Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
occurred in those documents. It is shown below using the above example:
preeti said will go to the market pooja join her sonia ajay not
1 1 3 2 1 1 1 1 1 1 1 1 1
Here, you can see that the document frequency of ‘go’ is 2 as it occurred in two documents on the other hand it is
3 for ‘will’ as it occurs in all three documents and rest of them occurred in just one document hence the document
frequency for them is 1.
Inverse Document Frequency
Inverse Document Frequency is obtained by dividing the document frequency of a specific word by the total
number of documents. It is shown below using the above example:
Total Number of Documents
IDF =
Document Frequency
preeti said will go to the market pooja join her sonia ajay not
3 3 3 3 3 3 3 3 3 3 3 3 3
1 1 3 2 1 1 1 1 1 1 1 1 1
Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) is calculated by multiplying Term Frequency of a word with
the log of Inverse Document Frequency.
Therefore, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10.
Now, let us calculate the TFIDF(W) for the above example:
Document preeti said will go to the market pooja join her sonia ajay not
2* 1* 1* 1* 1* 1* 1* 0* 0* 0* 0* 0* 0*
Doc-1 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
0* 0* 1* 0* 0* 0* 0* 1* 1* 1* 0* 0* 0*
Doc-2 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 1* 1* 1*
Doc-3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
Note that, the value of log (1) is 0.
Natural Language Processing (Theory) 385

