Page 381 - AI Ver 3.0 class 10_Flipbook
P. 381

Most of the human languages used across the world have punctuation marks to mark the boundaries of the
                 sentences, this feature helps reduce the complexity of the big data set into a low and less complicated level of data
                 processing. Each sentence after this will be treated as a separate document.


                        Artificial  Intelligence  is  the  science  and     1.  Artificial  Intelligence  is  the  science
                        engineering of making intelligent machines.           and engineering of making intelligent
                        AI  is  a  technique  of  getting  machines  to       machines.
                        work and behave like humans. The machines           2.  AI is a technique of getting machines
                        that  are  incorporated  with  human-like             to work and behave like humans.
                        intelligence to perform tasks as we do.
                                                                            3.  The  machines  that  are  incorporated
                                                                              with   human-like   intelligence   to
                                                                              perform tasks as we do.

                 Step 2  Tokenization

                 Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number
                 or special character that forms a part of a sentence. This process is done mainly by finding the boundaries of
                 a word i.e., where one word ends and the other word begins. In English, a space in between two words is an
                 important word boundary detector.

                                                                              The     machines    that     are

                           The  machines  that  are  incorporated  with
                                                                             incorporated  with  human   –    like
                           human-like intelligence to perform tasks as
                           we do.
                                                                              intelligence  to  perform   tasks

                                                                              as   we    do     .


                 Step 3  Removing Stopwords, Special Characters and Numbers

                 Stopwords are frequently occurring words that do not contribute significant meaning to text analysis but are
                 necessary for human readability. These words are removed during preprocessing to improve efficiency.

                 Some examples of stopwords are:


                                a        an        and         are       as        for        #         @


                                        from        is        into        in        if        $


                               on        or        such        the      there      to         %          !



                 At this stage all the stopwords or special characters like #$%@! or numbers if not needed are removed from the
                 list of tokens to make it easier for the NLP system to focus on the words that are important for data processing.
                 Step 4  Converting Text to a Common Case
                 This is a very important step as we want the same word but different case to be taken as one token so that the
                 program does not become case sensitive. We generally convert the whole content into a lower case to avoid this
                 kind of confusion and sensitivity by the system.


                                                                          Natural Language Processing (Theory)  379
   376   377   378   379   380   381   382   383   384   385   386