Page 246 - Ai_C10_Flipbook
P. 246

Aspect       Stemming                                      Lemmatization

               Output       Produces a truncated form of the word, which  Produces a meaningful, valid word (lemma).
                            may not be a valid word.
               Approach     Rule-based (simple removal of affixes like -ing,  Dictionary  or  vocabulary-based,  requiring
                            -ed, etc.).                                   morphological analysis.
               Speed        Faster, as it only involves simple string operations. Slower,  as  it  involves  more  computational
                                                                          complexity and context analysis.
               Accuracy     Less accurate, may produce results that are not  More  accurate,  results  are  meaningful  and
                            meaningful (e.g., “running” → “run”).         contextually appropriate.
               Use Cases    Used  when  speed  is  more  important  than  Used   in   applications   requiring   precise
                            precision, e.g., in search engines.           understanding of words, e.g., machine translation.

                       Techniques of Natural Language Processing


              There are many techniques used in NLP for extracting information but the three given below are most commonly
              used:
              1.  Bag of Words (BoW)

              2.  Term Frequency and Inverse Document Frequency (TFIDF)
              3.  Natural Language Toolkit (NLTK)
              Let us now study in detail how these techniques can be used for Textual Data Processing in NLP.

              Bag of Words

              After the process of text normalisation, the corpus is converted into a normalised corpus, which is a collection of
              meaningful words with no sequence.
              Bag of Words is a simple and important technique used in Natural Language Processing for extracting features
              from the textual data. It converts text sentences into numeric vectors by returning the unique words along with its
              number of occurrences.

                                      Bag of words (BoW)
                                      Johny Johny, yes Papa                           eating  1
                                      Eating sugar? No, Papa.                         ha     3
                                      Telling lies? No, Papa.                         johny   2
                                      Open your mouth! Ha,                            lies   1
                                                                        ha
                                      ha, ha!                    eating    johny      mouth   1
                                                                                      no     2
                                                                        lies
                                                                 mouth      no        open   1
                                                                                      papa   3
                                                                       papa
                                                                 open                 sugar  1
                                                                           sugar
                                                                    telling           telling  1
                                                                         your         yes    1
                                                                  yes
                                                                                      your   1
              This algorithm is named as Bag of Words because it contains meaningful words (also known as Tokens) scattered
              in a dataset just like a bag full of words scattered with no specific order. The Bag of Words algorithm returns:
                 • A vocabulary of unique words for the corpus.
                 • The frequency of these words i.e., the number of occurrences of each word.



                    244     Artificial Intelligence Play (Ver 1.0)-X
   241   242   243   244   245   246   247   248   249   250   251