Page 230 - AI Ver 3.0 Class 11
P. 230
Splitting data into training and testing sets is essential in machine learning for several reasons:
• • Evaluation of model performance: By splitting the dataset, you can evaluate how well your model generalises
to new, unseen data. The testing set serves as a proxy for real-world data, allowing you to assess the model’s
performance accurately.
• • Avoiding overfitting: Overfitting occurs when a model learns to memorise the training data’s patterns instead of
learning the underlying relationships. Splitting the data ensures that you can evaluate the model’s performance on
data it has not seen during training. If the model performs well on the testing set, it indicates that it has learned to
generalise rather than memorise.
• • Model selection: When comparing different models or algorithms, it’s crucial to have a standardised testing set
for fair comparison. Splitting the data ensures that each model is evaluated on the same set of unseen examples,
allowing you to make informed decisions about which model performs best.
Let us now split the data into a training set and a testing set.
Program 61: To split the data of the IRIS dataset into training set and testing set
from sklearn.model_selection import train_test_split
# load dataset
from sklearn.datasets import load_iris
iris = load_iris()
# separate the data into features and target
X = iris.data
y = iris.target
# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
state=1)
# Splitting the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
state=42)
# Printing the shapes of the training and testing sets to verify the split
print("Training set - Features:", X_train.shape, " Labels:", y_train.shape)
print("Testing set - Features:", X_test.shape, " Labels:", y_test.shape)
Output:
Training set - Features: (120, 4) Labels: (120,)
Testing set - Features: (30, 4) Labels: (30,)
Adding a Classifier: KNeighborsClassifier
Scikit-learn provides a diverse set of machine learning (ML) methods, each following a standard interface for tasks such
as model fitting, prediction, and performance metrics like accuracy and recall. One of the most important and commonly
used method is K-Nearest Neighbors (KNN) classifier.
K-Nearest Neighbors (KNN) is a simple classification algorithm used in supervised learning. It’s primarily employed for
classification tasks, although it can also be adapted for regression. The goal of this classifier is to assign labels to new
occurrences based on their resemblance to those in the training set.
KNN produces predictions based on the majority class of the ‘k’ nearest data points. This method is very useful for small
to medium-sized datasets and is simple to apply and analyse.
228 Touchpad Artificial Intelligence (Ver. 3.0)-XI

