Page 311 - Artificial Intellegence_v2.0_Class_11
P. 311
How is Clustering Done?
In order to cluster the data, the following steps are conducted:
1. Data Preparation: Data preparation means including effective data features for the clustering algorithm. The data
set must include descriptive features or any new features based on the original set that will be generated, in the
input dataset.
2. Creating Similarity Metric: The algorithm tries to understand how similar the pairs of samples are. You quantify
the similarity between the samples by creating a similarity metric. This requires clear understanding of your data and
how to derive similarity from the data features. For example, consider pin codes of an Indian state. If the difference
between two pin codes is small, this represents that the two regions denoted by the pin codes are close to each other
and have a higher similarity. When you can quantify the metric manually, it is called ‘manual similarity measure’.
3. Run the Clustering Algorithm: A clustering algorithm uses the similarity metric developed in step 2 to cluster data.
Clustering algorithms are able to handle processing of large datasets efficiently. However, they do need to compute
the similarity between all pairs of points.
4. Result Interpretation: Because clustering is unsupervised, the interpretation of results is crucial and can be handled
by a human expert. The results are verified against expectations and if improvement is required, the above steps are
repeated.
Types of Clustering
Clustering algorithms are quite popular. Let us learn about some of them.
Centroid-based Clustering
Centroid-based clustering arranges the data into non-hierarchical clusters. K-means clustering is the most popular
centroid-based clustering algorithm. Centroid-based algorithms are efficient but easily affected by the initial conditions
and outliers.
Y
X
Density-based Clustering
Density-based clustering groups high density areas into clusters. Hence, arbitrary-shaped distributions occur so that
dense areas can be connected. The data points in the separating regions of low density are considered outliers and not
assigned to clusters.
Classification & Clustering 309

