Page 315 - Artificial Intellegence_v2.0_Class_11
P. 315

Disadvantages of K-Means
                 Some of the disadvantages of K-Means Clustering are:
                 •  K has to be chosen manually and it is not an easy process.
                 •  The algorithm is dependent on initial values.
                 •  Outliers greatly affect the clustering process.

                 •  The algorithm has trouble grouping data where clusters are of fluctuating sizes and density.

                        K-Means Generalization

                 The biggest advantage of the k-means algorithm is that it can cluster large
                 data sets quite efficiently. Even to cluster naturally imbalanced clusters one
                 can  modify  or  generalize  the  k-means  algorithm.  As  the  following  graph
                 shows, the regions have different cluster widths. So, the resulting clusters are
                 elliptical. This greatly improves the result too.

                        Why is Clustering Unsupervised?


                 Clustering is an unsupervised machine learning technique that automatically divides the data into clusters or groups of
                 similar elements. The algorithm does this without any knowledge of how the groups should look in advance. So, clustering
                 is rather used for the discovery of knowledge rather than for prediction. It provides an idea of natural groupings that
                 are within data.
                 Without advanced knowledge of what a cluster includes, how can a computer know where a group begins or ends?  The
                 answer is simple. Clustering is driven by the principle that objects within a group should be very similar to each other,
                 but very different from the objects outside. The similarity function can vary across different applications, but the basic
                 idea is always the same—group the data so that the related elements are placed together.



                           At a Glance


                       •  Classification is the process of labelling a set of data (structured or unstructured) into different classes or
                       groups where we can assign a label to each class.
                       •  Classification problems/tasks that have only two class labels use binary classification.
                       •  Multiclass classification comprises of those classification tasks that have more than two class labels.
                       •  Logistic regression is used to predict binary outcomes.
                       •  A confusion matrix is used to verify the performance of a classification model i.e. how good are the classifier’s
                       predictions.
                       •  Accuracy is described as the percentage of correct predictions out of all the samples.
                       •  Precision is described as the percentage of positive identifications which were correct.
                       •  Recall is defined as the proportion of positive cases that are correctly identified.
                       •  F1 Score gives a measure of the balance between precision and recall.
                       •  Clustering is the task of grouping a data set into a set of similar items. It is an unsupervised algorithm.
                       •  Clustering is used in fake news detection and recommender systems.
                       •  Different  types  of  clustering  include  Centroid-based,  Density-based,  Distribution-based,  Hierarchical
                       clustering. The most popular is centroid based clustering.
                       •  The K-means algorithm identifies k number of centroids, and then assigns every data point to the nearest
                       cluster, while trying to keep the centroids as small as possible.




                                                                                    Classification & Clustering   313
   310   311   312   313   314   315   316   317   318   319   320