Page 315 - Artificial Intellegence_v2.0_Class_11
P. 315
Disadvantages of K-Means
Some of the disadvantages of K-Means Clustering are:
• K has to be chosen manually and it is not an easy process.
• The algorithm is dependent on initial values.
• Outliers greatly affect the clustering process.
• The algorithm has trouble grouping data where clusters are of fluctuating sizes and density.
K-Means Generalization
The biggest advantage of the k-means algorithm is that it can cluster large
data sets quite efficiently. Even to cluster naturally imbalanced clusters one
can modify or generalize the k-means algorithm. As the following graph
shows, the regions have different cluster widths. So, the resulting clusters are
elliptical. This greatly improves the result too.
Why is Clustering Unsupervised?
Clustering is an unsupervised machine learning technique that automatically divides the data into clusters or groups of
similar elements. The algorithm does this without any knowledge of how the groups should look in advance. So, clustering
is rather used for the discovery of knowledge rather than for prediction. It provides an idea of natural groupings that
are within data.
Without advanced knowledge of what a cluster includes, how can a computer know where a group begins or ends? The
answer is simple. Clustering is driven by the principle that objects within a group should be very similar to each other,
but very different from the objects outside. The similarity function can vary across different applications, but the basic
idea is always the same—group the data so that the related elements are placed together.
At a Glance
• Classification is the process of labelling a set of data (structured or unstructured) into different classes or
groups where we can assign a label to each class.
• Classification problems/tasks that have only two class labels use binary classification.
• Multiclass classification comprises of those classification tasks that have more than two class labels.
• Logistic regression is used to predict binary outcomes.
• A confusion matrix is used to verify the performance of a classification model i.e. how good are the classifier’s
predictions.
• Accuracy is described as the percentage of correct predictions out of all the samples.
• Precision is described as the percentage of positive identifications which were correct.
• Recall is defined as the proportion of positive cases that are correctly identified.
• F1 Score gives a measure of the balance between precision and recall.
• Clustering is the task of grouping a data set into a set of similar items. It is an unsupervised algorithm.
• Clustering is used in fake news detection and recommender systems.
• Different types of clustering include Centroid-based, Density-based, Distribution-based, Hierarchical
clustering. The most popular is centroid based clustering.
• The K-means algorithm identifies k number of centroids, and then assigns every data point to the nearest
cluster, while trying to keep the centroids as small as possible.
Classification & Clustering 313

