Page 197 - Touhpad Ai
P. 197
The data cleaning process is an essential step in data management and analysis. It means preparing raw data so that
it becomes accurate, complete, and ready for use. Since data is often collected from different sources, it may contain
errors, missing information, or inconsistencies. Data cleaning helps remove these problems and ensures that the data
can be trusted for decision-making or research.
1. Data collection: The process begins with gathering data from various sources such as surveys, sensors, websites,
databases, or reports. At this stage, the data may not be perfect — it can have errors, incomplete entries, or different
formats.
2. Data inspection: Before cleaning, the collected data is carefully examined to identify problems. This step helps detect:
Missing or blank values Duplicate records
Typing or spelling mistakes Incorrect data types
u
Inconsistent formats
u
Missing Values Duplicate data Spelling Errors Incorrect data types Inconsistent Formats
3. Handling missing data: Missing or incomplete information reduces the quality of data. To handle it, one of the
following methods is used:
Remove missing entries if they are very few.
u
Replace missing values with suitable estimates such as the average or most common value.
u
Recollect data if the missing part is important.
u
4. Removing duplicates: Sometimes, the same record appears more than once due to repeated data entry or merging
files. Duplicate records are removed to avoid counting the same information multiple times.
5. Correcting errors: Errors in data can include wrong spellings, typing mistakes, or incorrect figures. These are
corrected by comparing with reliable sources or using validation rules.
6. Standardising data: Data from different sources may follow different formats or units. For example:
Dates might be written as “05-10-2025” or “2025/10/05.”
u
Weights might be in kilograms or pounds.
u
Standardisation converts all data into a common and consistent format so that it can be easily compared and
analysed.
7. Checking data consistency: This step ensures that data values make sense together. For example:
A student’s age should match their grade level.
u
The date of delivery should not be earlier than the date of order.
u
Consistency checks help identify and correct such logical errors.
8. Handling outliers: Outliers are unusual or extreme values that do not fit with the rest of the data. For example, if most
students score between 60 and 90 but one score is 300, it is likely an error. Such values are examined carefully and
corrected or removed if necessary.
9. Verifying the cleaned data: After cleaning, the data is reviewed again to ensure that:
There are no missing or duplicate records.
u
All data follows the same format.
u
Values are reasonable and accurate.
u
Verification ensures that the data is now reliable for further use.
Theoretical and Practical Aspects of Data Processing 195

