Page 472 - AI Ver 3.0 class 10_Flipbook
P. 472
Tokenizing a word will split a sentence into words.
[1]: word_token=nltk.word_tokenize(data)
print(word_token)
['Hello', 'friends', '.', 'Hope', 'you', 'are', 'enjoying', 'doing', 'NLP', '.',
'Wish', 'you', 'a', 'wonderful', 'experience']
Stemming is the process of extracting base word from the given word:
[1]: from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem('studies')
'studi'
[1]: ps.stem('learning')
'learn'
Lemmatization is the process of extracting a base word called lemma. It is considered a better way than stemming
because stemming just removes the suffix without considering the actual meaning of the word.
It is done using WordNetLemmatizer package:
[1]: from nltk.stem.wordnet import WordNetLemmatizer
Lem = WordNetLemmatizer()
Lem.lemmatize("studies")
'study'
Using NLTK Stopwords Corpus
Stopwords such as is, am, are, this, a, an, the, etc. are insignificant in a sentence. It is to be removed to avoid noise
in a sentence.
[1]: from nltk.corpus import stopwords
Eng_stopwords = set(stopwords.words('english'))
print(Eng_stopwords)
{'the', 'yours', 'there', "wasn't", "hasn't", 'further', "that'll", 'am', 'itself',
'here', 'not', "aren't", 'under', 'having', 'now', 'his', 'an', 'of', 'below', 'few',
'such', 'by', 'needn', 'isn', 'again', 'these', "she's", 'about', 's', 'a', "couldn't",
'as', 'whom', 'were', 'off', "mustn't", 'before', 'didn', 'what', 'has', 'himself',
'is', 'other', 'then', 'haven', "weren't", 'won', 'through', 'only', 'wouldn', "hadn't",
"you'd", "it's", "haven't", 'at', 'for', 'some', 'myself', 'above', 'ours', 'shouldn',
'been', 'its', 'so', "don't", 'when', 'don', 'during', "you'll", 'yourselves', 'if',
'over', 'up', "should've", 'yourself', 'very', 'theirs', 'or', 'any', 'between',
'couldn', 'same', 'mustn', 'we', 'they', 'out', 'hers', 'was', 'from', 'that', 'your',
"mightn't", 'had', 're', 'd', 'will', 'because', 'each', 'our', 'both', 'mightn', 'are',
'doing', 'o', 'them', 'own', "shouldn't", 'those', 'more', 'themselves', 'until', 'in',
'my', "doesn't", 'does', 'aren', 'i', 'herself', 'with', 'wasn', 'hasn', 'this', 'ma',
'their', 'all', 't', 'll', 'ourselves', 'into', 'did', 'once', 'doesn', "shan't", 'hadn',
"needn't", 'being', 'can', 'too', 'weren', 'do', 'which', 'm', 'no', 'against', 'than',
'him', 'and', 'on', 'most', "you're", 'have', 'you', 'down', 've', 'be', 'where', 'me',
'ain', "you've", 'after', 'it', 'who', "isn't", "wouldn't", 'why', 'nor', 'y', 'while',
'should', 'she', 'to', "won't", 'how', "didn't", 'he', 'her', 'shan', 'but', 'just'}
470 Touchpad Artificial Intelligence (Ver. 3.0)-X

