Page 245 - Ai_C10_Flipbook
P. 245
Step 3 Removing Stopwords, Special Characters and Numbers
Stopwords are frequently occurring words that do not contribute significant meaning to text analysis but are
necessary for human readability. These words are removed during preprocessing to improve efficiency.
Some examples of stopwords are a, an, the, is, are, was, were, he, she, it, they, we, and, but, or, nor, in, on, at, with,
by, from, to, etc.
At this stage all the stopwords or special characters like #$%@! or numbers if not needed are removed from the
list of tokens to make it easier for the NLP system to focus on the words that are important for data processing.
Step 4 Converting Text to a Common Case
This is a very important step as we want the same word but different case to be taken as one token so that the
program does not become case sensitive. We generally convert the whole content into a lower case to avoid this
kind of confusion and sensitivity by the system.
ORANGE Orange ORANGe oRANGE oRaNGE OrangE
orange
Step 5 Stemming
The process of removing the affixes from the words to reduce them to their root words is called Stemming. This
process helps in normalising the text into its root form but the disadvantage is that it works on all the affixes
irrespective whether a base word is a meaningful word or not. Hence it is a faster process.
Word Affixes Stem
flies -es fli
flying -ing fly
In stemming, the resulting stemmed words (obtained after removing affixes) may not always be meaningful. For
example, in this case, "jumped," "jumping," and "jumper" were all reduced to "jump," whereas "flies" was shortened
to "fli," which is not a meaningful word. Stemming does not consider whether the stemmed word makes sense; it
simply removes affixes, making the process faster.
Step 6 Lemmatization
This is a process of removing the affixes from the words to create a meaningful root word. The word we get after
removing the affix is called lemma. Since it always focusses on creating a meaningful lemma, the processing time
is longer and better from stemming.
Word Affixes Lemma
flies -es fly
flying -ing fly
Difference between Stemming and Lemmatization
Stemming is simpler and faster but less precise. On the other hand, Lemmatization is slower and more
computationally intensive but produces contextually meaningful results. Their differences can be summarised
as follows:
Aspect Stemming Lemmatization
Definition Reduces a word to its root form by chopping off Reduces a word to its base or dictionary form
prefixes or suffixes without considering meaning. (lemma) considering the context and meaning.
Natural Language Processing 243

