Page 472 - AI Ver 3.0 class 10_Flipbook
P. 472

Tokenizing a word will split a sentence into words.

                 [1]:   word_token=nltk.word_tokenize(data)
                        print(word_token)

                        ['Hello', 'friends', '.', 'Hope', 'you', 'are', 'enjoying', 'doing', 'NLP', '.',
                        'Wish', 'you', 'a', 'wonderful', 'experience']

              Stemming is the process of extracting base word from the given word:

                 [1]:   from nltk.stem import PorterStemmer
                        ps = PorterStemmer()
                        ps.stem('studies')

                        'studi'

                 [1]:   ps.stem('learning')

                        'learn'
              Lemmatization is the process of extracting a base word called lemma. It is considered a better way than stemming
              because stemming just removes the suffix without considering the actual meaning of the word.

              It is done using  WordNetLemmatizer package:

                 [1]:   from nltk.stem.wordnet import WordNetLemmatizer
                        Lem = WordNetLemmatizer()
                        Lem.lemmatize("studies")
                        'study'

              Using NLTK Stopwords Corpus


              Stopwords such as is, am, are, this, a, an, the, etc. are insignificant in a sentence. It  is to be removed to avoid noise
              in a sentence.
                 [1]:   from nltk.corpus import stopwords
                        Eng_stopwords = set(stopwords.words('english'))
                        print(Eng_stopwords)
                        {'the', 'yours', 'there', "wasn't", "hasn't", 'further', "that'll",  'am', 'itself',
                        'here', 'not', "aren't", 'under', 'having', 'now', 'his', 'an', 'of', 'below', 'few',
                        'such', 'by', 'needn', 'isn', 'again', 'these', "she's", 'about', 's', 'a', "couldn't",
                        'as',  'whom',  'were',  'off',  "mustn't",  'before',  'didn',  'what',  'has',  'himself',
                        'is', 'other', 'then', 'haven', "weren't", 'won', 'through', 'only', 'wouldn', "hadn't",
                        "you'd", "it's", "haven't", 'at', 'for', 'some', 'myself', 'above', 'ours', 'shouldn',
                        'been',  'its',  'so',  "don't",  'when',  'don',  'during',  "you'll",  'yourselves',  'if',
                        'over', 'up', "should've",  'yourself',  'very', 'theirs', 'or', 'any', 'between',
                        'couldn', 'same', 'mustn', 'we', 'they', 'out', 'hers', 'was', 'from', 'that', 'your',
                        "mightn't", 'had', 're', 'd', 'will', 'because', 'each', 'our', 'both', 'mightn', 'are',
                        'doing', 'o', 'them', 'own', "shouldn't", 'those', 'more', 'themselves', 'until', 'in',
                        'my', "doesn't", 'does', 'aren', 'i', 'herself', 'with', 'wasn', 'hasn', 'this', 'ma',
                        'their', 'all', 't', 'll', 'ourselves', 'into', 'did', 'once', 'doesn', "shan't", 'hadn',
                        "needn't", 'being', 'can', 'too', 'weren', 'do', 'which', 'm', 'no', 'against', 'than',
                        'him', 'and', 'on', 'most', "you're", 'have', 'you', 'down', 've', 'be', 'where', 'me',
                        'ain', "you've", 'after', 'it', 'who', "isn't", "wouldn't", 'why', 'nor', 'y', 'while',
                        'should', 'she', 'to', "won't", 'how', "didn't", 'he', 'her', 'shan', 'but', 'just'}





                    470     Touchpad Artificial Intelligence (Ver. 3.0)-X
   467   468   469   470   471   472   473   474   475   476   477