Page 386 - AI Ver 3.0 class 10_Flipbook
P. 386
These words are valuable because they provide context about the document’s subject.
Now, imagine we have a collection of 10 documents, each discussing a different topic. One might be about renewable
energy, another about artificial intelligence, and so on. Do you think "climate" and "change" would still be the most
frequent words across the entire corpus? Likely not. Instead, words like and, is, or the would appear most frequently.
These high-frequency words, such as and, is, the, etc., do not provide meaningful information about the corpus’s
content. While these words are essential for humans to comprehend sentences, they are irrelevant for a machine
as they do not contribute to understanding the topics within the corpus. These are known as stop words and are
usually removed during the pre-processing stage.
The graph to represent the text in all documents in corpus will be:
Stop words
Occurrence Frequent words
Rare/Valuable
words
Value
This graph shows the frequency of words against their significance. Words that appear most frequently across all
the documents in a corpus hold minimal value and are classified as stop words. These are typically removed during
the pre-processing stage.
As we move past the stop words, the frequency of occurrence drops sharply, and words that occur moderately in
the corpus are considered to have some value. These are referred to as frequent words and generally relate to the
document’s subject.
Further down the frequency scale, the occurrence of words decreases, but their significance increases. These are
known as rare or valuable words. Though they appear the least, they contribute the most meaning to the corpus.
Therefore, when analysing text, we focus on both frequent and rare words.
TFIDF stands for Term Frequency and Inverse Document Frequency. This method is considered better than the
Bag of Words algorithm. Because, Bag of Words gives the numeric vector of each word in the document but TFIDF
through its numeric value gives the importance of each word in the document.
TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
a numeric weight based on its occurrence within that document and across the entire corpus.
Let’s continue with the above example,
The corpus with 3 documents is given as follows:
• Document 1: Preeti said, “Preeti will go to the market.”
• Document 2: Pooja will join her.
• Document 3: Sonia & Ajay will not go.
Note, As the corpus is too small we are not removing the stopwords.
After normalising the text and creating the dictionary of words, now create a Term-Frequency table.
384 Touchpad Artificial Intelligence (Ver. 3.0)-X

