Page 298 - AI Ver 1.0 Class 10
P. 298

Now, let us calculate the TFIDF(W) for the above example:

                   I        like    oranges      also    bananas      and        are      good        for     health

                 1*log     1*log      1*log     0*log      0*log      0*log     0*log     0*log      0*log     0*log
                 (3/2)      (3/2)      (3)       (3)       (3/2)       (3)       (3)        (3)       (3)       (3)

                 1*log     1*log      0*log     1*log      1*log      0*log     0*log     0*log      0*log     0*log
                 (3/2)      (3/2)     (3/2)      (3)       (3/2)       (3)       (3)        (3)       (3)       (3)

                 0*log     0*log      1*log     0*log      1*log      1*log     1*log     1*log      1*log     1*log
                 (3/2)      (3/2)     (3/2)      (3)       (3/2)       (3)       (3)        (3)       (3)       (3)


              After calculating all the values, we get:

                   I        like     oranges     also    bananas      and        are      good       for      health

                 0.176      0.176     0.176       0          0         0          0         0         0          0

                 0.176      0.176       0       0.477      0.176       0          0         0         0          0

                   0         0        0.176       0        0.176     0.477      0.477     0.477     0.477      0.477

              Finally, the words have been converted to numbers. These numbers are the values of each word for each document.

              Hence after the end of the above process we get:
                 • There are stopwords with high term frequencies in all the documents but have the least numeric value.
                 • In order to make the words to have high TF IDF value the term frequency should be high but the document
                frequency should be less i.e., there may be words that are important for one document but are not common for
                all the other documents in corpus.
                 • These numeric values represent the words that need to be considered while processing NLP. Higher numeric
                value of these words means they are more important for a given corpus.


              Applications of TFIDF

              Some of the important applications of TFIDF are:
                 • Document Classification: It helps in the classification of the documents scattered in the internet based on their
                types, genre, etc.
                 • Topic Modelling: It helps in predicting the topic of the corpus.
                 • Information Retrieval System: It searches the corpus and retrieves the information based on most relevant
                searches.
                 • Stop Word Filtering: It helps in removing the stop words from the documents in the corpus so that the data
                retrieval and processing can focus on words which are important for data processing.


              NLTK

              The Natural Language Toolkit (NLTK) is one of the most commonly used open-source NLP toolkit that is made
              up of Python libraries and is used for building programs that help in synthesis and statistical analysis of human
              language processing. The text processing libraries do text processing through tokenization, parsing, classification,
              stemming, tagging and semantic reasoning.



                        296   Touchpad Artificial Intelligence-X
   293   294   295   296   297   298   299   300   301   302   303