Page 309 - Ai_V3.0_c11_flipbook
P. 309
The Training data is utilised to teach machine learning models, while the Testing data assesses how well the trained
models perform. During modelling, suitable machine learning algorithms are selected based on the problem type
(e.g., classification, regression, clustering) and dataset characteristics.
Training data vs. Testing data
Feature Training data Testing data
Training data is a learning phase. The more Testing Data is used to check the performance of
Purpose training data the model has, the better it can the model.
make predictions.
The model learns from the training data to The testing data is not exposed to the model
Exposure
make accurate predictions. before evaluation. Testing data is the new data.
The distribution of the training data should be
Distribution like the distribution of the real-world data that The distribution of the testing data may be
the model will be used in. entirely different from the real-world data.
The size of the testing data is smaller than the
The training data is larger in size as the model
Size needs to analyse and observe the patterns for training data because it is used to evaluate the
making accurate predictions. performance of the model that has been trained
on the training data.
Various techniques like train-test split, cross-validation, and error analysis are employed to gauge the model’s
generalisation ability and pinpoint areas for enhancement. In the Train Test Split technique, dataset is divided into
two sets: Training and Testing. It trains the model with the training data and assesses its performance using the
testing data. Cross Validation ensures consistent model performance across different data subsets. You will study
these in detail in class XII.
Different evaluation techniques are applied depending on the data type. For classification problems, metrics such as
accuracy, precision, recall, F1-score, and ROC curve are commonly used. For regression tasks, metrics like Mean Squared
Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared are frequently used.
In today’s era, having proficiency in handling data is crucial. With the rise of Artificial Intelligence, understanding data
allows us to leverage information effectively. It’s akin to have a map for navigating a large city; being adept with data
empowers us to make informed decisions and utilise technology wisely.
At a Glance
• Data literacy involves the ability to find and use data proficiently.
• Data can be structured, semi-structured, or unstructured.
• AI data analysis employs AI techniques and Data Science to enhance the processes of cleaning, inspecting,
and modelling over both structured and unstructured data.
• Data collection means gathering data from many sources, both offline and online.
• Primary and secondary are the two main sources from where data is collected.
• Primary data is obtained directly from the source and has not been previously published or analysed by
others.
• Secondary data can be obtained from research articles, books, reports, and internet databases.
• The method used to measure a collection of data is known as the level of measurement.
Data Literacy—Data Collection to Data Analysis 307

