Page 230 - AI Ver 3.0 Class 11
P. 230

Splitting data into training and testing sets is essential in machine learning for several reasons:
               • •    Evaluation of model performance: By splitting the dataset, you can evaluate how well your model generalises
                  to new, unseen data. The testing set serves as a proxy for real-world data, allowing you to assess the model’s
                  performance accurately.
               • •    Avoiding overfitting: Overfitting occurs when a model learns to memorise the training data’s patterns instead of
                  learning the underlying relationships. Splitting the data ensures that you can evaluate the model’s performance on
                  data it has not seen during training. If the model performs well on the testing set, it indicates that it has learned to
                  generalise rather than memorise.
               • •    Model selection: When comparing different models or algorithms, it’s crucial to have a standardised testing set
                  for fair comparison. Splitting the data ensures that each model is evaluated on the same set of unseen examples,
                  allowing you to make informed decisions about which model performs best.
              Let us now split the data into a training set and a testing set.

                Program 61: To split the data of the IRIS dataset into training set and testing set

                   from sklearn.model_selection import train_test_split
                   # load dataset
                   from sklearn.datasets import load_iris
                   iris = load_iris()

                   # separate the data into features and target
                   X = iris.data
                   y = iris.target
                   # Split the data: 80% for training, 20% for testing
                   X_train, X_test, y_train, y_test = train_test_split(X,  y, test_size=0.2,  random_
                   state=1)

                   # Splitting the data into training and testing sets (80% training, 20% testing)
                   X_train, X_test, y_train, y_test = train_test_split(X,  y, test_size=0.2,  random_
                   state=42)
                   # Printing the shapes of the training and testing sets to verify the split
                   print("Training set - Features:", X_train.shape, " Labels:", y_train.shape)
                   print("Testing set - Features:", X_test.shape, " Labels:", y_test.shape)
              Output:
                  Training set - Features: (120, 4)  Labels: (120,)

                  Testing set - Features: (30, 4)  Labels: (30,)

              Adding a Classifier: KNeighborsClassifier
              Scikit-learn provides a diverse set of machine learning (ML) methods, each following a standard interface for tasks such
              as model fitting, prediction, and performance metrics like accuracy and recall. One of the most important and commonly
              used method is K-Nearest Neighbors (KNN) classifier.

              K-Nearest Neighbors (KNN) is a simple classification algorithm used in supervised learning. It’s primarily employed for
              classification tasks, although it can also be adapted for regression. The goal of this classifier is to assign labels to new
              occurrences based on their resemblance to those in the training set.

              KNN produces predictions based on the majority class of the ‘k’ nearest data points. This method is very useful for small
              to medium-sized datasets and is simple to apply and analyse.

                    228     Touchpad Artificial Intelligence (Ver. 3.0)-XI
   225   226   227   228   229   230   231   232   233   234   235