Page 249 - Ai_C10_Flipbook
P. 249
Document Frequency
Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
occurred in those documents. It is shown below using the above example:
preeti said will go to the market pooja join her sonia ajay not
1 1 3 2 1 1 1 1 1 1 1 1 1
Here, you can see that the document frequency of ‘go’ is 2 as it occurred in two documents on the other hand it is
3 for ‘will’ as it occurs in all three documents and rest of them occurred in just one document hence the document
frequency for them is 1.
Inverse Document Frequency
Inverse Document Frequency is obtained by dividing the document frequency of a specific word by the total
number of documents. It is shown below using the above example:
Total Number of Documents
IDF =
Document Frequency
preeti said will go to the market pooja join her sonia ajay not
3 3 3 3 3 3 3 3 3 3 3 3 3
1 1 3 2 1 1 1 1 1 1 1 1 1
Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency (TF-IDF) is calculated by multiplying Term Frequency of a word with
the log of Inverse Document Frequency.
Therefore, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10.
Now, let us calculate the TFIDF(W) for the above example:
Document preeti said will go to the market pooja join her sonia ajay not
2* 1* 1* 1* 1* 1* 1* 0* 0* 0* 0* 0* 0*
Doc-1 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
0* 0* 1* 0* 0* 0* 0* 1* 1* 1* 0* 0* 0*
Doc-2 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 1* 1* 1*
Doc-3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3 log 3
1 1 3 2 1 1 1 1 1 1 1 1 1
Note that, the value of log (1) is 0.
After calculating all the values, we get:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 0.9542 0.4771 0 0.1761 0.4771 0.4771 0.4771 0 0 0 0 0 0
Doc-2 0 0 0 0 0 0 0 0.4771 0.4771 0.4771 0 0 0
Doc-3 0 0 0 0.1761 0 0 0 0 0 0 0.4771 0.4771 0.4771
Finally, the words have been converted to numbers. These numbers are the values of each word for each document.
Hence after the end of the above process we get:
• Stopwords generally have high term frequencies in all documents but tend to have lower TF-IDF values.
Natural Language Processing 247

