Page 292 - AI Ver 1.0 Class 10
P. 292
gm good morning
gr8, grt great
The whole textual data from all the documents altogether is known as corpus. Using this textual data let us now
do an activity along with learning the important steps of Text Normalisation:
Step 1: Sentence Segmentation
Sentence segmentation is a process of Sentence Boundary Detection which reduces the corpus into a sentence.
Most of the human languages used across the world have punctuation marks to mark the boundaries of the
sentence so this feature helps in bringing down the complexity of the big data set into a low and less complicated
level of Data processing. Each sentence after this will be a separate data.
Artificial Intelligence is the science 1. Artificial Intelligence is the science
and engineering of making intelligent and engineering of making intelligent
machines. AI is a technique of getting machines.
machines to work and behave 2. AI is a technique of getting machines
like humans. The machines that to work and behave like humans.
are incorporated with human-like
intelligence to perform tasks as we do. 3. The machines that are incorporated
with human-like intelligence to
perform tasks as we do.
Step 2: Tokenization
Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number or
special character that forms a part of a sentence. This process is done mainly by finding the boundaries of a word
i.e., where one word ends and the other word begins. In English, a space in between two words is an important
word boundary detector.
The machines that are
The machines that are incorporated incorporated with human – like
with human-like intelligence to
perform tasks as we do. intelligence to perform tasks
as we do .
Step 3: Removing Stopwords, Special Characters and Numbers
The frequently occurring words that make a meaningful sentence but for the machine they are a complete waste
as they do not provide us with any information regarding the corpus are called stopwords. Hence, they are mostly
removed at the pre-processing stage only.
290 Touchpad Artificial Intelligence-X

