** ? 6.1 The curse of dimensionality**

**Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are**

**9.1 What is clustering?**

**Clustering**deals with a partitioning of datasets into coherent subsets
in an unsupervised manner. The lack of supervision means that these
algorithms are given a dataset of observationswithoutany target
label that we would like to be able to recover or give accurate
predic-tions for based on the further variables describing the observapredic-tions.

Instead, we are only given the data points asm-dimensional obser-vations and we are interested in finding such a grouping of the data points such that those that aresimilar to each otherin some – previ-ously unspecified sense – would be grouped together. Since similar-ity of the data points is an important aspect of clustering techniques, those techniques discussed previously in Chapter4and Chapter5 would be of great relevance for performing clustering.

As a concrete example, we could perform clustering over a collec-tion of novels based on their textual contents in order to find the ones that has a high topical overlap. As another example, users with simi-lar financial habits could be identified by looking at their credit card history. Those people identified as having similar behavior could then be offered the same products by some financial institution, or even better special offers could be provided for some dedicated clus-ter of people with high business value.

**Learning Objectives:**

• The task of clustering

• Difference between agglomerative and partitioning approaches

• Hierarchical clustering techniques

• Improving hierarchical clustering

• k-means clustering

• Bradley-Fayyad-Reina algorithm

*9.1.1* Illustrating the difficulty of clustering

As mentioned earlier, the main goal of clustering algorithms is to find groups of coherently behaving data points based on the com-monalities in their representation. The main difficulty of clustering is that coherent and arguably sensible subsets of the input data can be formed in multiple ways. Figure9.1illustrates this problem. If we were given the raw dataset as illustrated in Figure9.1(a), we could argue that those data points that are located on the samey-coordinate form a cluster as depicted in Figure9.1(b). One could argue that a similar clustering which assigns points with the samex-coordinate into the same cluster would also make sense.

It turns out, however, that the ‘true’ distribution of our data fol-lows the one illustrated in Figure9.1(c)as the clusters within the

data correspond to letters from the**Braille alphabet. The difficulty** Can you decode what is written in
Figure9.1(c)?

of clustering is that thetruedistribution is never known in reality.

### ?

Indeed, if we already knew the underlaying data distribution there would be no reason to perform clustering in the first place. Also, this example illustrates that the relation which makes the data points belong together can sometimes be a subtle non-linear one.

Clustering is hence not a well determined problem in the sense that multiple different solutions could be obtained for the same input data. Clustering on the other hand is of great practical importance as we can find hidden (and hopefully valid) structure within datasets.

(a) The raw unlabeled data.

(b) One sensible clustering of the data based on their y-coordinates.

(c) The real structure of the data.

Figure9.1: A schematic illustration of clustering.

c l u s t e r i n g 193

*9.1.2* What makes a good clustering?

Kleinberg imposed three desired properties regarding the behavior

of clustering algorithms^{1}. The three criteria were**scale invariance,** ^{1}Kleinberg2002

**completeness**and**consistency.**

What these concepts mean for a set of data pointsSand an
associ-ated distance over themd :S×S →**R**^{+}∪ {0}are described below.

Let us think of the output of some clustering algorithm as a function f(d,S)which – when provided by a notion of pairwise distances over the points fromS– provides a disjoint partitioning of the datasetS that we denote byΓ. Having introduced these notations, we can now revisit the three desiderata introduced by Kleinberg.

**Scale invariance**means that the clustering algorithm should be
insensitive for rescaling the pairwise distances and provide the exact
same output for the same dataset if the notion of pairwise distances
change by a constant factor. That is,∀^{d,}* ^{α}* >0 ⇒

^{f}(d,S) = f(

*αd,*S). What it means intuitively, that imagining that our observations are described by vectors that indicate the size of a certain object, the output of the clustering algorithm should not differ if we provide our measurements in millimeters or if we provide them in miles or centimeters.

The**richness**property requires that if we have the freedom of
changingd, i.e., the notion of pairwise distances overS, the clustering
algorithm f should be able to output all possible partitioning of the
dataset. To put it more formally,∀Γ∃^{d}^{:} ^{f}(d,S) =Γ.

The**consistency**criterion for a clustering algorithm f requires
the output of f to be the same whenever thedis modified by a
Γ-transformation. AΓ-transformationis such a transformation over
some distancedand a clustering function f, such that the distancesd^{0}
obtained by theΓ-transformation are such that the distances between
pairs of points

• assigned to the same cluster by f do not increase,

• assigned to a different cluster by f do not decrease.

A clustering fulfils consistency, if for anyd^{0}obtained by aΓ-transformation
we have f(d,S) =Γ⇒ ^{f}(d^{0},S) =Γ.

Although these properties might sound intuitive, Kleinberg pointed out that it is impossible to construct such a clustering al-gorithm f that would meet all the three criteria at the same time.

Constructing clustering algorithms that meet two out of the previous desiderata is nonetheless feasible and this is the best we can hope for.