Page 295 - AI Ver 1.0 Class 10
P. 295

Let us now study in details how these techniques can be used for Textual Data Processing in NLP.


                 Bag of Words

                 After the process of text normalisation the corpus is converted into normalised corpus which is just a collection of
                 meaningful words with no sequence.

                 Bag of Words is a simple and important technique used in Natural Language Processing for extracting features
                 from the textual data. It converts text sentences into numeric vectors by returning the unique words along with its
                 number of occurrences.


                          Bag of words (BoW)

                       Alan Turing was a brilliant British                                      ('Alan', 2)
                       Mathematician,    Biologist   and                                        ('Turing', 4)
                       Computer    Scientist.  His  Turing                                      ('the', 6)
                       Machine  was  one  of  the  first,  basic                                ('is', 4)
                       computers created. In 1950, Alan                                         ('a', 7)
                       Turing published a ground breaking                                       ('Test', 2)
                       seminal paper “Computing Machinery                                       ('intelligence', 3)
                                                                                                ('of', 5)
                       and  Intelligence”  on  the  topic  of                  the              ('artificial', 2)
                       artificial  intelligence.  It  introduced       Alan          Test       ('known', 1)
                       the concept of what is now known as                  intelligence        ('as', 2)
                       Turing Test.                                     in                      ('and', 3)
                       The test is still a matter of standards        breaking     known        ('it', 3)
                       today. It establishes that if a computer                   as            ('in', 1)
                       can have a simple dialogue with a                                        ('ground', 1)
                       person via a printer, then that itself is     artificial     and         ('breaking', 1)
                       a proof that the machine is “thinking”.                                  ('mathematician', 1)
                                                                                                ('machinery', 1)
                       It was for this work that led him to be                                  .................
                       regarded as the Father of Theoretical
                       Computer  Science  and  Artificial
                       Intelligence.



                 This  algorithm  is  named  as  Bag  of  Words  because  it  contains  meaningful  words  (also  known  as  Tokens)
                 scattered in a dataset just like a bag full of words scattered with no specific order. The Bag of Words algorithm
                 returns:
                    • A vocabulary of words for the corpus.

                    • The frequency of these words i.e., the number of occurrences of each word.

                 The Steps involved in Bag of Words algorithm are:
                    • Text Normalisation: The collection of data is processed to get normalised corpus.

                    • Create Dictionary: This step will create a list of all unique words available in normalised corpus.
                    • Create Document Vectors: For each document in the corpus, create a list of unique words with its number of
                   occurrences.
                    • Create Document Vectors for all the Documents: Repeat Step 3 for all documents in the corpus to create a
                   “Document Vector Table”.



                                                                               Natural Language Processing  293
   290   291   292   293   294   295   296   297   298   299   300