Page 19 - CT_AI_Class-8
P. 19
For developing an AI system, a large amount of data is required. However, the data collected from
different sources is usually raw, unstructured and unorganised. Such data cannot be directly used
to train an AI model because it may contain errors, missing information or unnecessary details.
Therefore, data preparation is an important step in the AI project cycle where the collected data
is cleaned, organised and formatted properly before using it for training.
During data collection, the dataset may contain the following issues:
Missing values: These are data points that are not recorded, left blank or unavailable in the
dataset. Missing values can affect the accuracy of the AI model and must be handled carefully.
Duplicate entries: These are repeated records that appear more than once in the dataset.
Duplicate data can lead to biased results and must be removed.
Incorrect information: These are inaccurate or wrongly entered data values such as incorrect
numbers, spelling mistakes or invalid entries.
Irrelevant data: These are data points that are not useful for solving the given problem. Such
data should be removed to improve model performance.
Inconsistent data: Sometimes data is recorded in different formats, such as using "Yes/No" in
one place and "Y/N" in another. This inconsistency must be corrected.
Outliers: These are unusually high or low values that differ significantly from the rest of the
data and may affect model performance.
Since raw data contains many such issues, it must be cleaned and organised before using it to
train an AI model. This process is known as Data Preparation.
The steps in Data Preparation are as follows:
1. Data cleaning: Data cleaning involves removing errors and improving data quality.
The following steps are included:
Remove incorrect or irrelevant data
Fill missing values using suitable methods
Remove duplicate entries
Correct inconsistent values
Handle outliers if required
For example:
Fixing Inconsistent Labels
Screen Time Level:
High and H → High
Low and L → Low
AI Project Lifecycle 17

