Page 298 - AI Ver 1.0 Class 10
P. 298
Now, let us calculate the TFIDF(W) for the above example:
I like oranges also bananas and are good for health
1*log 1*log 1*log 0*log 0*log 0*log 0*log 0*log 0*log 0*log
(3/2) (3/2) (3) (3) (3/2) (3) (3) (3) (3) (3)
1*log 1*log 0*log 1*log 1*log 0*log 0*log 0*log 0*log 0*log
(3/2) (3/2) (3/2) (3) (3/2) (3) (3) (3) (3) (3)
0*log 0*log 1*log 0*log 1*log 1*log 1*log 1*log 1*log 1*log
(3/2) (3/2) (3/2) (3) (3/2) (3) (3) (3) (3) (3)
After calculating all the values, we get:
I like oranges also bananas and are good for health
0.176 0.176 0.176 0 0 0 0 0 0 0
0.176 0.176 0 0.477 0.176 0 0 0 0 0
0 0 0.176 0 0.176 0.477 0.477 0.477 0.477 0.477
Finally, the words have been converted to numbers. These numbers are the values of each word for each document.
Hence after the end of the above process we get:
• There are stopwords with high term frequencies in all the documents but have the least numeric value.
• In order to make the words to have high TF IDF value the term frequency should be high but the document
frequency should be less i.e., there may be words that are important for one document but are not common for
all the other documents in corpus.
• These numeric values represent the words that need to be considered while processing NLP. Higher numeric
value of these words means they are more important for a given corpus.
Applications of TFIDF
Some of the important applications of TFIDF are:
• Document Classification: It helps in the classification of the documents scattered in the internet based on their
types, genre, etc.
• Topic Modelling: It helps in predicting the topic of the corpus.
• Information Retrieval System: It searches the corpus and retrieves the information based on most relevant
searches.
• Stop Word Filtering: It helps in removing the stop words from the documents in the corpus so that the data
retrieval and processing can focus on words which are important for data processing.
NLTK
The Natural Language Toolkit (NLTK) is one of the most commonly used open-source NLP toolkit that is made
up of Python libraries and is used for building programs that help in synthesis and statistical analysis of human
language processing. The text processing libraries do text processing through tokenization, parsing, classification,
stemming, tagging and semantic reasoning.
296 Touchpad Artificial Intelligence-X

