Page 244 - Ai_C10_Flipbook
P. 244

Text Normalisation
                                                                                      Words           Canonical Form
              Text normalisation is the process of cleaning textual data by converting
              it into a standard form. It is considered as the pre-processing stage   B4, beefor, bifore  before
              of NLP, as it is the first step before beginning actual data processing.   2morrow, 2mrow  tomorrow
              This process helps reduce the complexity of the language. Words
                                                                                        btw             by the way
              used  as  slang,  short  forms,  misspellings,  abbreviations,  or  special
              characters with specific meanings need to be converted into their          ty              thank you
              canonical form during text normalisation. For example,                    gm             good morning
              A corpus is a large collection of text, such as articles, rhymes, or    gr8, grt             great
              email. A document is a single piece of text within the corpus, like
              a sentence in an article, a line in a rhyme, or a section of an email. The entire set of text from all the documents
              together is known as the corpus.

              Steps for Text Normalisation

              The steps for text normalisation are as follows:


                                                        Removing
                           Sentence                     Stopwords,    Converting
                         Segmentation    Tokenisation    Special       text to a     Stemming      Lemmatization
                                                      Characters and   common case
                                                        Numbers

               Step 1  Sentence Segmentation

              Sentence segmentation is a process of detecting sentence boundaries, which divides the corpus into sentences or
              documents.
              Most of the human languages used across the world have punctuation marks to mark the boundaries of the
              sentences, this feature helps reduce the complexity of the big data set into a low and less complicated level of data
              processing. Each sentence after this will be treated as a separate document.

                                                                     1.  Artificial Intelligence is the science and engineering
                 Artificial  Intelligence  is  the  science  and
                                                                       of making intelligent machines.
                 engineering of making intelligent machines.
                 AI  is  a  technique  of  getting  machines  to     2.  AI is a technique of getting machines to work and
                 work and behave like humans. The machines             behave like humans.
                 that  are  incorporated  with  human-like
                                                                     3.  The machines that are incorporated with human-
                 intelligence to perform tasks as we do.
                                                                       like intelligence to perform tasks as we do.


               Step 2  Tokenization
              Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number
              or special character that forms a part of a sentence. This process is done mainly by finding the boundaries of
              a word i.e., where one word ends and the other word begins. In English, a space in between two words is an
              important word boundary detector.


                                                                     The    machines  that   are   incorporated  with
                  The  machines  that  are  incorporated  with
                  human-like intelligence to perform tasks as        human    –   like  intelligence  to  perform  tasks
                  we do.
                                                                     as  we  do   .

                    242     Artificial Intelligence Play (Ver 1.0)-X
   239   240   241   242   243   244   245   246   247   248   249