Page 381 - AI Ver 3.0 class 10_Flipbook
P. 381
Most of the human languages used across the world have punctuation marks to mark the boundaries of the
sentences, this feature helps reduce the complexity of the big data set into a low and less complicated level of data
processing. Each sentence after this will be treated as a separate document.
Artificial Intelligence is the science and 1. Artificial Intelligence is the science
engineering of making intelligent machines. and engineering of making intelligent
AI is a technique of getting machines to machines.
work and behave like humans. The machines 2. AI is a technique of getting machines
that are incorporated with human-like to work and behave like humans.
intelligence to perform tasks as we do.
3. The machines that are incorporated
with human-like intelligence to
perform tasks as we do.
Step 2 Tokenization
Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number
or special character that forms a part of a sentence. This process is done mainly by finding the boundaries of
a word i.e., where one word ends and the other word begins. In English, a space in between two words is an
important word boundary detector.
The machines that are
The machines that are incorporated with
incorporated with human – like
human-like intelligence to perform tasks as
we do.
intelligence to perform tasks
as we do .
Step 3 Removing Stopwords, Special Characters and Numbers
Stopwords are frequently occurring words that do not contribute significant meaning to text analysis but are
necessary for human readability. These words are removed during preprocessing to improve efficiency.
Some examples of stopwords are:
a an and are as for # @
from is into in if $
on or such the there to % !
At this stage all the stopwords or special characters like #$%@! or numbers if not needed are removed from the
list of tokens to make it easier for the NLP system to focus on the words that are important for data processing.
Step 4 Converting Text to a Common Case
This is a very important step as we want the same word but different case to be taken as one token so that the
program does not become case sensitive. We generally convert the whole content into a lower case to avoid this
kind of confusion and sensitivity by the system.
Natural Language Processing (Theory) 379

