Page 248 - Ai_C10_Flipbook
P. 248

These words are valuable because they provide context about the document’s subject.
              Now, imagine we have a collection of 10 documents, each discussing
              a different topic. One might be about renewable energy, another
                                                                                     Stop words
              about artificial intelligence, and so on. Do you think "climate" and
              "change" would still be the most frequent words across the entire
              corpus? Likely not. Instead, words like and, is, or the would appear
              most frequently.                                                 Occurrence
              These high-frequency words, such as and, is, the, etc., do not                Frequent words
              provide meaningful information about the corpus’s content. While
              these words are essential for humans to comprehend sentences,                               Rare/Valuable
                                                                                                             words
              they are irrelevant for a machine as they do not contribute to
              understanding the topics within the corpus. These are known as
                                                                                               Value
              stop words and are usually removed during the pre-processing
              stage.

              The graph to represent the text in all documents in corpus will be:
              This graph shows the frequency of words against their significance. Words that appear most frequently across all
              the documents in a corpus hold minimal value and are classified as stop words. These are typically removed during
              the pre-processing stage.
              As we move past the stop words, the frequency of occurrence drops sharply, and words that occur moderately in
              the corpus are considered to have some value. These are referred to as frequent words and generally relate to the
              document’s subject.
              Further down the frequency scale, the occurrence of words decreases, but their significance increases. These are
              known as rare or valuable words. Though they appear the least, they contribute the most meaning to the corpus.
              Therefore, when analysing text, we focus on both frequent and rare words.
              TFIDF stands for Term Frequency and Inverse Document Frequency. This method is considered better than the
              Bag of Words algorithm. Because, Bag of Words gives the numeric vector of each word in the document but TFIDF
              through its numeric value gives the importance of each word in the document.
              TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
              a numeric weight based on its occurrence within that document and across the entire corpus.
              Let’s continue with the above example,
              The corpus with 3 documents is given as follows:
                 • Document 1: Preeti said, “Preeti will go to the market.”

                 • Document 2: Pooja will join her.
                 • Document 3: Sonia & Ajay will not go.
              Note, As the corpus is too small we are not removing the stopwords.
              After normalising the text and creating the dictionary of words, now create a Term-Frequency table.
              Term Frequency

              Term  Frequency  is  the  frequency  of  a  word  in  one  document.  Term  frequency  can  easily  be  found  from  the
              document vector table as shown in the above example:

               Document      preeti said   will   go    to     the   market pooja join      her    sonia  ajay   not
               Doc-1         2      1      1      1     1      1     1        0       0     0      0      0      0
               Doc-2         0      0      1      0     0      0     0        1       1     1      0      0      0
               Doc-3         0      0      1      1     0      0     0        0       0     0      1      1      1


                    246     Artificial Intelligence Play (Ver 1.0)-X
   243   244   245   246   247   248   249   250   251   252   253