Page 388 - AI Ver 3.0 class 10_Flipbook
P. 388

After calculating all the values, we get:

               Document     preeti  said   will   go    to    the    market   pooja  join   her    sonia  ajay  not
               Doc-1        0.9542  0.4771 0     0.1761 0.4771 0.4771 0.4771  0      0      0     0       0     0
               Doc-2        0       0      0     0      0     0      0       0.4771  0.4771 0.4771 0      0     0
               Doc-3        0       0      0     0.1761 0     0      0       0       0      0     0.4771 0.4771 0.4771
              Finally, the words have been converted to numbers. These numbers are the values of each word for each document.
              Hence after the end of the above process we get:
                 • Stopwords generally have high term frequencies in all documents but tend to have lower TF-IDF values.
                 • To achieve a high TF-IDF value, the term frequency (TF) should be high, but the document frequency (DF) should
                be low. That is, some words may be important for one document but are not common across all the other
                documents in the corpus.
                 • These numeric values represent the words that need to be considered while processing NLP. A higher TF-IDF
                value indicates that a word is more significant for distinguishing a document within a given corpus.

              Applications of TFIDF

              Some of the important applications of TFIDF are:
                 • Document Classification: It helps in categorising documents based on their content, such as topic, genre, or
                subject matter.
                 • Topic Modelling: It helps in predicting the topic of the corpus.
                 • Information Retrieval System: It searches the corpus and retrieves the information based on most relevant
                searches.
                 • Stop Word Filtering: It helps in removing the stop words from the documents in the corpus so that the data
                retrieval and processing can focus on words which are important for data processing.


                            Brainy Fact



                    NLTK was first released in 2001 by Steven Bird and Edward Loper. It is one of the oldest and most well-known
                    Python libraries for processing natural language.

              Natural Language Toolkit (NLTK)


              The Natural Language Toolkit (NLTK) is one of the most commonly used open-source NLP toolkit that is made
              up of Python libraries and is used for building programs that help in synthesis and statistical analysis of human
              language processing. The text processing libraries do text processing through tokenization, parsing, classification,
              stemming, tagging and semantic reasoning.
              Some important NLP tools are:
                 • spaCY: spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. It offers fast
                processing, pre-trained models, and deep learning integration, making it a top choice for text analysis, chatbots,
                and AI applications. spaCy provides advanced capabilities to perform NLP on large volumes of text with high
                speed and efficiency.
                 • Gensim: Gensim is an open-source NLP library designed for topic modelling and document similarity analysis. It
                is highly efficient for processing large-scale text data and is widely used in machine learning and NLP applications.
                 • No-code: No-code tools make it easier for businesses and individuals to leverage AI without programming skills
                by offering in-built models and user-friendly interfaces.

                    386     Touchpad Artificial Intelligence (Ver. 3.0)-X
   383   384   385   386   387   388   389   390   391   392   393