Page 19 - CT_AI_Class-8
P. 19

For developing an AI system, a large amount of data is required. However, the data collected from
                 different sources is usually raw, unstructured and unorganised. Such data cannot be directly used
                 to train an AI model because it may contain errors, missing information or unnecessary details.
                 Therefore, data preparation is an important step in the AI project cycle where the collected data
                 is cleaned, organised and formatted properly before using it for training.

                 During data collection, the dataset may contain the following issues:

                   Missing values: These are data points that are not recorded, left blank or unavailable in the
                    dataset. Missing values can affect the accuracy of the AI model and must be handled carefully.

                   Duplicate entries: These are repeated records that appear more than once in the dataset.
                    Duplicate data can lead to biased results and must be removed.
                   Incorrect information: These are inaccurate or wrongly entered data values such as incorrect
                    numbers, spelling mistakes or invalid entries.

                   Irrelevant data: These are data points that are not useful for solving the given problem. Such
                    data should be removed to improve model performance.

                   Inconsistent data: Sometimes data is recorded in different formats, such as using "Yes/No" in
                    one place and "Y/N" in another. This inconsistency must be corrected.

                   Outliers: These are unusually high or low values that differ significantly from the rest of the
                    data and may affect model performance.

                 Since raw data contains many such issues, it must be cleaned and organised before using it to
                 train an AI model. This process is known as Data Preparation.
                 The steps in Data Preparation are as follows:

                 1.  Data cleaning: Data  cleaning  involves removing  errors and  improving  data  quality.
                    The following steps are included:

                     Remove incorrect or irrelevant data

                       Fill missing values using suitable methods
                     Remove duplicate entries

                       Correct inconsistent values

                     Handle outliers if required
                    For example:

                    Fixing Inconsistent Labels

                    Screen Time Level:
                     High and H → High

                       Low and L → Low









                                                                                           AI Project Lifecycle    17
   14   15   16   17   18   19   20   21   22   23   24