What is clustering?

In document DATAMINING GÁBORBEREND (Pldal 191-194)

? 6.1 The curse of dimensionality

Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are

9.1 What is clustering?

Clusteringdeals with a partitioning of datasets into coherent subsets in an unsupervised manner. The lack of supervision means that these algorithms are given a dataset of observationswithoutany target label that we would like to be able to recover or give accurate predic-tions for based on the further variables describing the observapredic-tions.

Instead, we are only given the data points asm-dimensional obser-vations and we are interested in finding such a grouping of the data points such that those that aresimilar to each otherin some – previ-ously unspecified sense – would be grouped together. Since similar-ity of the data points is an important aspect of clustering techniques, those techniques discussed previously in Chapter4and Chapter5 would be of great relevance for performing clustering.

As a concrete example, we could perform clustering over a collec-tion of novels based on their textual contents in order to find the ones that has a high topical overlap. As another example, users with simi-lar financial habits could be identified by looking at their credit card history. Those people identified as having similar behavior could then be offered the same products by some financial institution, or even better special offers could be provided for some dedicated clus-ter of people with high business value.

Learning Objectives:

• The task of clustering

• Difference between agglomerative and partitioning approaches

• Hierarchical clustering techniques

• Improving hierarchical clustering

• k-means clustering

• Bradley-Fayyad-Reina algorithm

9.1.1 Illustrating the difficulty of clustering

As mentioned earlier, the main goal of clustering algorithms is to find groups of coherently behaving data points based on the com-monalities in their representation. The main difficulty of clustering is that coherent and arguably sensible subsets of the input data can be formed in multiple ways. Figure9.1illustrates this problem. If we were given the raw dataset as illustrated in Figure9.1(a), we could argue that those data points that are located on the samey-coordinate form a cluster as depicted in Figure9.1(b). One could argue that a similar clustering which assigns points with the samex-coordinate into the same cluster would also make sense.

It turns out, however, that the ‘true’ distribution of our data fol-lows the one illustrated in Figure9.1(c)as the clusters within the

data correspond to letters from theBraille alphabet. The difficulty Can you decode what is written in Figure9.1(c)?

of clustering is that thetruedistribution is never known in reality.

?

Indeed, if we already knew the underlaying data distribution there would be no reason to perform clustering in the first place. Also, this example illustrates that the relation which makes the data points belong together can sometimes be a subtle non-linear one.

Clustering is hence not a well determined problem in the sense that multiple different solutions could be obtained for the same input data. Clustering on the other hand is of great practical importance as we can find hidden (and hopefully valid) structure within datasets.

(a) The raw unlabeled data.

(b) One sensible clustering of the data based on their y-coordinates.

(c) The real structure of the data.

Figure9.1: A schematic illustration of clustering.

c l u s t e r i n g 193

9.1.2 What makes a good clustering?

Kleinberg imposed three desired properties regarding the behavior

of clustering algorithms1. The three criteria werescale invariance, 1Kleinberg2002

completenessandconsistency.

What these concepts mean for a set of data pointsSand an associ-ated distance over themd :S×S →R+∪ {0}are described below.

Let us think of the output of some clustering algorithm as a function f(d,S)which – when provided by a notion of pairwise distances over the points fromS– provides a disjoint partitioning of the datasetS that we denote byΓ. Having introduced these notations, we can now revisit the three desiderata introduced by Kleinberg.

Scale invariancemeans that the clustering algorithm should be insensitive for rescaling the pairwise distances and provide the exact same output for the same dataset if the notion of pairwise distances change by a constant factor. That is,∀d,α >0 ⇒ f(d,S) = f(αd,S). What it means intuitively, that imagining that our observations are described by vectors that indicate the size of a certain object, the output of the clustering algorithm should not differ if we provide our measurements in millimeters or if we provide them in miles or centimeters.

Therichnessproperty requires that if we have the freedom of changingd, i.e., the notion of pairwise distances overS, the clustering algorithm f should be able to output all possible partitioning of the dataset. To put it more formally,∀Γ∃d: f(d,S) =Γ.

Theconsistencycriterion for a clustering algorithm f requires the output of f to be the same whenever thedis modified by a Γ-transformation. AΓ-transformationis such a transformation over some distancedand a clustering function f, such that the distancesd0 obtained by theΓ-transformation are such that the distances between pairs of points

• assigned to the same cluster by f do not increase,

• assigned to a different cluster by f do not decrease.

A clustering fulfils consistency, if for anyd0obtained by aΓ-transformation we have f(d,S) =Γ⇒ f(d0,S) =Γ.

Although these properties might sound intuitive, Kleinberg pointed out that it is impossible to construct such a clustering al-gorithm f that would meet all the three criteria at the same time.

Constructing clustering algorithms that meet two out of the previous desiderata is nonetheless feasible and this is the best we can hope for.

In document DATAMINING GÁBORBEREND (Pldal 191-194)