Page 297 - AI Ver 1.0 Class 10
P. 297
Step 4: Repeat the above Steps for all Documents
In the above normalised corpus, we have three documents. So, three lines will be created after this step to create
our Document Vector Table as shown below:
I like oranges also bananas and are good for health
1 1 1 0 0 0 0 0 0 0
1 1 0 1 1 0 0 0 0 0
0 0 1 0 1 1 1 1 1 1
Term Frequency and Inverse Document Frequency (TFIDF)
This method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each
word in the document but TFIDF through its numeric value gives us the importance of each word in the document.
TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
a numeric value as shown below:
Term Frequency
Term Frequency is the frequency of a word in one document. Term frequency can easily be found from the document
vector table as shown in the above example:
I like oranges also bananas and are good for health
1 1 1 0 0 0 0 0 0 0
1 1 0 1 1 0 0 0 0 0
0 0 1 0 1 1 1 1 1 1
Document Frequency
Document Frequency is the number of documents in which the word occurs irrespective of how many times it has
occurred in those documents. It is shown below using the above example:
I like oranges also bananas and are good for health
2 2 2 1 2 1 1 1 1 1
Inverse Document Frequency
Inverse Document Frequency is obtained when document frequency is in the denominator and the total number
of documents is the numerator. It is shown below using the above example:
I like oranges also bananas and are good for health
3/2 3/2 3/2 3/1 3/2 3/1 3/1 3/1 3/1 3/1
Therefore, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10.
Natural Language Processing 295

