Page 308 - AI_Ver_3.0_class_11
P. 308
• Outliers (extreme values): Outliers are data points that deviate significantly from most of the dataset, typically
due to errors or uncommon occurrences. Managing outliers includes detecting and excluding them, transforming
the data, or applying robust statistical techniques to minimise their influence.
• Inconsistent Data: Inconsistent data, such as typographical errors or variations in data types, is rectified to
ensure uniformity and coherence across the dataset.
• Duplicate Data: Duplicate data is identified and eliminated to maintain data integrity and accuracy.
2. Data Transformation: This process involves converting data into a format suitable
for analysis. Common techniques include normalisation, standardisation, and
discretisation. Normalisation scales the data to a common range, standardisation
adjusts the data to have a zero mean and unit variance, and discretisation converts
continuous data into discrete categories. Existing features may also be adjusted as
15%
necessary.
25%
15%
45%
3. Data Reduction: This process decreases the data volume, making analysis
easier while yielding the same or nearly the same results. It also helps to save
storage space. Common data reduction techniques include dimensionality
reduction (reducing the number of features in a dataset) and data compression.
4. Data Integration and Normalisation: Data from multiple sources or formats
is combined or aggregated (data is presented in the form of a summary).
Subsequently, the data is normalised to ensure uniform scale and distribution
across all features, enhancing the effectiveness of machine learning models.
Data integration is a key component of data management.
5. Feature Selection: This step involves choosing a subset of important features
from the dataset. Feature selection is commonly done to eliminate irrelevant
or redundant features from the dataset.
Data in Modelling and Evaluation
Once data preprocessing is complete, it’s divided into two sets: the Training data and the Testing data.
Data
Training 70% Testing 30%
306 Touchpad Artificial Intelligence (Ver. 3.0)-XI

