Page 244 - Ai_C10_Flipbook
P. 244
Text Normalisation
Words Canonical Form
Text normalisation is the process of cleaning textual data by converting
it into a standard form. It is considered as the pre-processing stage B4, beefor, bifore before
of NLP, as it is the first step before beginning actual data processing. 2morrow, 2mrow tomorrow
This process helps reduce the complexity of the language. Words
btw by the way
used as slang, short forms, misspellings, abbreviations, or special
characters with specific meanings need to be converted into their ty thank you
canonical form during text normalisation. For example, gm good morning
A corpus is a large collection of text, such as articles, rhymes, or gr8, grt great
email. A document is a single piece of text within the corpus, like
a sentence in an article, a line in a rhyme, or a section of an email. The entire set of text from all the documents
together is known as the corpus.
Steps for Text Normalisation
The steps for text normalisation are as follows:
Removing
Sentence Stopwords, Converting
Segmentation Tokenisation Special text to a Stemming Lemmatization
Characters and common case
Numbers
Step 1 Sentence Segmentation
Sentence segmentation is a process of detecting sentence boundaries, which divides the corpus into sentences or
documents.
Most of the human languages used across the world have punctuation marks to mark the boundaries of the
sentences, this feature helps reduce the complexity of the big data set into a low and less complicated level of data
processing. Each sentence after this will be treated as a separate document.
1. Artificial Intelligence is the science and engineering
Artificial Intelligence is the science and
of making intelligent machines.
engineering of making intelligent machines.
AI is a technique of getting machines to 2. AI is a technique of getting machines to work and
work and behave like humans. The machines behave like humans.
that are incorporated with human-like
3. The machines that are incorporated with human-
intelligence to perform tasks as we do.
like intelligence to perform tasks as we do.
Step 2 Tokenization
Tokenization is the process of dividing the sentences further into tokens. A token can be any word or number
or special character that forms a part of a sentence. This process is done mainly by finding the boundaries of
a word i.e., where one word ends and the other word begins. In English, a space in between two words is an
important word boundary detector.
The machines that are incorporated with
The machines that are incorporated with
human-like intelligence to perform tasks as human – like intelligence to perform tasks
we do.
as we do .
242 Artificial Intelligence Play (Ver 1.0)-X

