Page 311 - Artificial Intellegence_v2.0_Class_11
P. 311

How is Clustering Done?
                 In order to cluster the data, the following steps are conducted:
                 1.   Data Preparation: Data preparation means including effective data features for the clustering algorithm. The data
                    set must include descriptive features or any new features based on the original set that will be generated, in the
                    input dataset.
                 2.   Creating Similarity Metric: The algorithm tries to understand how similar the pairs of samples are. You quantify
                    the similarity between the samples by creating a similarity metric. This requires clear understanding of your data and
                    how to derive similarity from the data features. For example, consider pin codes of an Indian state. If the difference
                    between two pin codes is small, this represents that the two regions denoted by the pin codes are close to each other
                    and have a higher similarity. When you can quantify the metric manually, it is called ‘manual similarity measure’.
                 3.   Run the Clustering Algorithm: A clustering algorithm uses the similarity metric developed in step 2 to cluster data.
                    Clustering algorithms are able to handle processing of large datasets efficiently. However, they do need to compute
                    the similarity between all pairs of points.
                 4.   Result Interpretation: Because clustering is unsupervised, the interpretation of results is crucial and can be handled
                    by a human expert. The results are verified against expectations and if improvement is required, the above steps are
                    repeated.


                 Types of Clustering
                 Clustering algorithms are quite popular. Let us learn about some of them.

                 Centroid-based Clustering
                 Centroid-based  clustering  arranges  the  data  into  non-hierarchical  clusters.  K-means  clustering  is  the  most  popular
                 centroid-based clustering algorithm. Centroid-based algorithms are efficient but easily affected by the initial conditions
                 and outliers.










                                      Y















                                                                    X


                 Density-based Clustering
                 Density-based clustering groups high density areas into clusters. Hence, arbitrary-shaped distributions occur so that
                 dense areas can be connected. The data points in the separating regions of low density are considered outliers and not
                 assigned to clusters.







                                                                                    Classification & Clustering   309
   306   307   308   309   310   311   312   313   314   315   316