Page 248 - Ai_C10_Flipbook
P. 248
These words are valuable because they provide context about the document’s subject.
Now, imagine we have a collection of 10 documents, each discussing
a different topic. One might be about renewable energy, another
Stop words
about artificial intelligence, and so on. Do you think "climate" and
"change" would still be the most frequent words across the entire
corpus? Likely not. Instead, words like and, is, or the would appear
most frequently. Occurrence
These high-frequency words, such as and, is, the, etc., do not Frequent words
provide meaningful information about the corpus’s content. While
these words are essential for humans to comprehend sentences, Rare/Valuable
words
they are irrelevant for a machine as they do not contribute to
understanding the topics within the corpus. These are known as
Value
stop words and are usually removed during the pre-processing
stage.
The graph to represent the text in all documents in corpus will be:
This graph shows the frequency of words against their significance. Words that appear most frequently across all
the documents in a corpus hold minimal value and are classified as stop words. These are typically removed during
the pre-processing stage.
As we move past the stop words, the frequency of occurrence drops sharply, and words that occur moderately in
the corpus are considered to have some value. These are referred to as frequent words and generally relate to the
document’s subject.
Further down the frequency scale, the occurrence of words decreases, but their significance increases. These are
known as rare or valuable words. Though they appear the least, they contribute the most meaning to the corpus.
Therefore, when analysing text, we focus on both frequent and rare words.
TFIDF stands for Term Frequency and Inverse Document Frequency. This method is considered better than the
Bag of Words algorithm. Because, Bag of Words gives the numeric vector of each word in the document but TFIDF
through its numeric value gives the importance of each word in the document.
TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given
a numeric weight based on its occurrence within that document and across the entire corpus.
Let’s continue with the above example,
The corpus with 3 documents is given as follows:
• Document 1: Preeti said, “Preeti will go to the market.”
• Document 2: Pooja will join her.
• Document 3: Sonia & Ajay will not go.
Note, As the corpus is too small we are not removing the stopwords.
After normalising the text and creating the dictionary of words, now create a Term-Frequency table.
Term Frequency
Term Frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as shown in the above example:
Document preeti said will go to the market pooja join her sonia ajay not
Doc-1 2 1 1 1 1 1 1 0 0 0 0 0 0
Doc-2 0 0 1 0 0 0 0 1 1 1 0 0 0
Doc-3 0 0 1 1 0 0 0 0 0 0 1 1 1
246 Artificial Intelligence Play (Ver 1.0)-X

