Page 245 - Ai_C10_Flipbook
P. 245

Step 3  Removing Stopwords, Special Characters and Numbers

                 Stopwords are frequently occurring words that do not contribute significant meaning to text analysis but are
                 necessary for human readability. These words are removed during preprocessing to improve efficiency.
                 Some examples of stopwords are a, an, the, is, are, was, were, he, she, it, they, we, and, but, or, nor, in, on, at, with,
                 by, from, to, etc.
                 At this stage all the stopwords or special characters like #$%@! or numbers if not needed are removed from the
                 list of tokens to make it easier for the NLP system to focus on the words that are important for data processing.
                 Step 4  Converting Text to a Common Case
                 This is a very important step as we want the same word but different case to be taken as one token so that the
                 program does not become case sensitive. We generally convert the whole content into a lower case to avoid this
                 kind of confusion and sensitivity by the system.

                                     ORANGE      Orange    ORANGe     oRANGE     oRaNGE      OrangE


                                                                 orange
                 Step 5  Stemming

                 The process of removing the affixes from the words to reduce them to their root words is called Stemming. This
                 process helps in normalising the text into its root form but the disadvantage is that it works on all the affixes
                 irrespective whether a base word is a meaningful word or not. Hence it is a faster process.


                                               Word              Affixes             Stem
                                                flies              -es                 fli

                                               flying              -ing               fly

                 In stemming, the resulting stemmed words (obtained after removing affixes) may not always be meaningful. For
                 example, in this case, "jumped," "jumping," and "jumper" were all reduced to "jump," whereas "flies" was shortened
                 to "fli," which is not a meaningful word. Stemming does not consider whether the stemmed word makes sense; it
                 simply removes affixes, making the process faster.
                 Step 6  Lemmatization

                 This is a process of removing the affixes from the words to create a meaningful root word. The word we get after
                 removing the affix is called lemma. Since it always focusses on creating a meaningful lemma, the processing time
                 is longer and better from stemming.

                                               Word              Affixes            Lemma

                                                flies              -es                fly
                                               flying              -ing               fly

                 Difference between Stemming and Lemmatization
                 Stemming  is  simpler  and  faster  but  less  precise.  On  the  other  hand,  Lemmatization  is  slower  and  more
                 computationally intensive but produces contextually meaningful results. Their differences can be summarised
                 as follows:

                  Aspect      Stemming                                      Lemmatization
                  Definition  Reduces a word to its root form by chopping off  Reduces a word to its base or dictionary form
                              prefixes or suffixes without considering meaning. (lemma) considering the context and meaning.

                                                                                  Natural Language Processing   243
   240   241   242   243   244   245   246   247   248   249   250