Page 197 - Touhpad Ai
P. 197

The data cleaning process is an essential step in data management and analysis. It means preparing raw data so that
                 it becomes accurate, complete, and ready for use. Since data is often collected from different sources, it may contain
                 errors, missing information, or inconsistencies. Data cleaning helps remove these problems and ensures that the data
                 can be trusted for decision-making or research.
                 1.  Data collection: The process begins with gathering data from various sources such as surveys, sensors, websites,
                    databases, or reports. At this stage, the data may not be perfect — it can have errors, incomplete entries, or different
                    formats.
                 2.  Data inspection: Before cleaning, the collected data is carefully examined to identify problems. This step helps detect:
                        Missing or blank values                 Duplicate records

                        Typing or spelling mistakes             Incorrect data types
                    u
                        Inconsistent formats
                    u









                      Missing Values     Duplicate data     Spelling Errors   Incorrect data types   Inconsistent Formats


                 3.  Handling  missing  data: Missing or incomplete information reduces the quality of data. To handle it, one of the
                    following methods is used:
                        Remove missing entries if they are very few.
                    u
                        Replace missing values with suitable estimates such as the average or most common value.
                    u
                        Recollect data if the missing part is important.
                    u
                 4.  Removing duplicates: Sometimes, the same record appears more than once due to repeated data entry or merging
                    files. Duplicate records are removed to avoid counting the same information multiple times.
                 5.  Correcting  errors: Errors in data  can  include  wrong spellings,  typing  mistakes,  or incorrect  figures.  These  are
                    corrected by comparing with reliable sources or using validation rules.
                 6.  Standardising data: Data from different sources may follow different formats or units. For example:
                        Dates might be written as “05-10-2025” or “2025/10/05.”
                    u
                        Weights might be in kilograms or pounds.
                    u
                    Standardisation converts all data into a common and consistent format so that it can be easily compared and
                    analysed.
                 7.  Checking data consistency: This step ensures that data values make sense together. For example:
                        A student’s age should match their grade level.
                    u
                        The date of delivery should not be earlier than the date of order.
                    u
                    Consistency checks help identify and correct such logical errors.
                 8.  Handling outliers: Outliers are unusual or extreme values that do not fit with the rest of the data. For example, if most
                    students score between 60 and 90 but one score is 300, it is likely an error. Such values are examined carefully and
                    corrected or removed if necessary.
                 9.  Verifying the cleaned data: After cleaning, the data is reviewed again to ensure that:
                        There are no missing or duplicate records.
                    u
                        All data follows the same format.
                    u
                        Values are reasonable and accurate.
                    u
                    Verification ensures that the data is now reliable for further use.


                                                                      Theoretical and Practical Aspects of Data Processing  195
   192   193   194   195   196   197   198   199   200   201   202