Page 386 - AI Ver 3.0 class 10_Flipbook
P. 386

These words are valuable because they provide context about the document’s subject.
              Now, imagine we have a collection of 10 documents, each discussing a different topic. One might be about renewable
              energy, another about artificial intelligence, and so on. Do you think "climate" and "change" would still be the most
              frequent words across the entire corpus? Likely not. Instead, words like and, is, or the would appear most frequently.
              These high-frequency words, such as and, is, the, etc., do not provide meaningful information about the corpus’s
              content. While these words are essential for humans to comprehend sentences, they are irrelevant for a machine
              as they do not contribute to understanding the topics within the corpus. These are known as stop words and are
              usually removed during the pre-processing stage.
              The graph to represent the text in all documents in corpus will be:




                                                     Stop words



                                              Occurrence     Frequent words







                                                                             Rare/Valuable
                                                                                 words


                                                                 Value
              This graph shows the frequency of words against their significance. Words that appear most frequently across all
              the documents in a corpus hold minimal value and are classified as stop words. These are typically removed during
              the pre-processing stage.
              As we move past the stop words, the frequency of occurrence drops sharply, and words that occur moderately in
              the corpus are considered to have some value. These are referred to as frequent words and generally relate to the
              document’s subject.
              Further down the frequency scale, the occurrence of words decreases, but their significance increases. These are
              known as rare or valuable words. Though they appear the least, they contribute the most meaning to the corpus.
              Therefore, when analysing text, we focus on both frequent and rare words.
              TFIDF stands for Term Frequency and Inverse Document Frequency. This method is considered better than the
              Bag of Words algorithm. Because, Bag of Words gives the numeric vector of each word in the document but TFIDF
              through its numeric value gives the importance of each word in the document.
              TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
              a numeric weight based on its occurrence within that document and across the entire corpus.
              Let’s continue with the above example,
              The corpus with 3 documents is given as follows:
                 • Document 1: Preeti said, “Preeti will go to the market.”
                 • Document 2: Pooja will join her.
                 • Document 3: Sonia & Ajay will not go.

              Note, As the corpus is too small we are not removing the stopwords.
              After normalising the text and creating the dictionary of words, now create a Term-Frequency table.


                    384     Touchpad Artificial Intelligence (Ver. 3.0)-X
   381   382   383   384   385   386   387   388   389   390   391