Page 292 - AI Ver 1.0 Class 10
P. 292

gm                           good morning


                                        gr8, grt                     great


              The whole textual data from all the documents altogether is known as corpus. Using this textual data let us now
              do an activity along with learning the important steps of Text Normalisation:


              Step 1: Sentence Segmentation

              Sentence segmentation is a process of Sentence Boundary Detection which reduces the corpus into a sentence.
              Most of the human languages used across the world have punctuation marks to mark the boundaries of the
              sentence so this feature helps in bringing down the complexity of the big data set into a low and less complicated
              level of Data processing. Each sentence after this will be a separate data.


                   Artificial Intelligence is the science                   1.  Artificial Intelligence is the science
                   and engineering of making intelligent                       and engineering of making intelligent
                   machines. AI is a technique of getting                      machines.
                   machines    to   work   and    behave                    2.  AI is a technique of getting machines
                   like humans. The machines that                              to work and behave like humans.
                   are incorporated with human-like
                   intelligence to perform tasks as we do.                  3.  The  machines that  are incorporated
                                                                               with  human-like   intelligence  to
                                                                               perform tasks as we do.



              Step 2: Tokenization
              Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number or
              special character that forms a part of a sentence. This process is done mainly by finding the boundaries of a word
              i.e., where one word ends and the other word begins. In English, a space in between two words is an important
              word boundary detector.


                                                                                 The     machines    that     are
                   The  machines  that  are  incorporated                       incorporated  with  human   –    like
                   with    human-like      intelligence   to

                   perform tasks as we do.                                       intelligence  to  perform   tasks


                                                                                 as   we    do     .



              Step 3: Removing Stopwords, Special Characters and Numbers
              The frequently occurring words that make a meaningful sentence but for the machine they are a complete waste
              as they do not provide us with any information regarding the corpus are called stopwords. Hence, they are mostly
              removed at the pre-processing stage only.





                        290   Touchpad Artificial Intelligence-X
   287   288   289   290   291   292   293   294   295   296   297