Page 246 - Ai_C10_Flipbook
P. 246
Aspect Stemming Lemmatization
Output Produces a truncated form of the word, which Produces a meaningful, valid word (lemma).
may not be a valid word.
Approach Rule-based (simple removal of affixes like -ing, Dictionary or vocabulary-based, requiring
-ed, etc.). morphological analysis.
Speed Faster, as it only involves simple string operations. Slower, as it involves more computational
complexity and context analysis.
Accuracy Less accurate, may produce results that are not More accurate, results are meaningful and
meaningful (e.g., “running” → “run”). contextually appropriate.
Use Cases Used when speed is more important than Used in applications requiring precise
precision, e.g., in search engines. understanding of words, e.g., machine translation.
Techniques of Natural Language Processing
There are many techniques used in NLP for extracting information but the three given below are most commonly
used:
1. Bag of Words (BoW)
2. Term Frequency and Inverse Document Frequency (TFIDF)
3. Natural Language Toolkit (NLTK)
Let us now study in detail how these techniques can be used for Textual Data Processing in NLP.
Bag of Words
After the process of text normalisation, the corpus is converted into a normalised corpus, which is a collection of
meaningful words with no sequence.
Bag of Words is a simple and important technique used in Natural Language Processing for extracting features
from the textual data. It converts text sentences into numeric vectors by returning the unique words along with its
number of occurrences.
Bag of words (BoW)
Johny Johny, yes Papa eating 1
Eating sugar? No, Papa. ha 3
Telling lies? No, Papa. johny 2
Open your mouth! Ha, lies 1
ha
ha, ha! eating johny mouth 1
no 2
lies
mouth no open 1
papa 3
papa
open sugar 1
sugar
telling telling 1
your yes 1
yes
your 1
This algorithm is named as Bag of Words because it contains meaningful words (also known as Tokens) scattered
in a dataset just like a bag full of words scattered with no specific order. The Bag of Words algorithm returns:
• A vocabulary of unique words for the corpus.
• The frequency of these words i.e., the number of occurrences of each word.
244 Artificial Intelligence Play (Ver 1.0)-X

