** ? 6.1 The curse of dimensionality**

**Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are**

**9.2 Agglomerative clustering**

The first family of clustering techniques we introduce is
**agglomera-tive clustering. The way these clustering algorithms work is that they**
initially assign each and every data point into a cluster of its own
then they gradually start merging them together until all the clusters
belong into a single cluster. This bottom-up strategy builds up a
hierarchy based on the arrangement of the data points and this is
why this kind of approach is also referred to as**hierarchic clustering.**

The pseudocode for agglomerative clustering is provided in Al-gorithm5. It illustrates that in the beginning every data point is assigned to a unique cluster which then get merged into a hierarchi-cal structure by repeated mergers of pairs of clusters based on their inter-cluster distance. Applying different strategies for determining the inter-cluster distances could produce different clustering out-comes. Hence, an important question is how do we determine these inter-cluster distances that we shall discuss next.

**Algorithm**^{5}: Pseudocode for
agglom-erative clustering.

**Require:** Data pointsD

**Ensure:** Hierarchic clustering ofD

1: **function**Ag g l o m e r at i v eCl u s t e r i n g(D^{)}

2: i=0

3: **for**d∈ D^{do}

4: i=i+1

5: Ci ={d} // each data point gets assigned to an individual cluster
6: **endfor**

7: **for**(k=1; k < i; ++k)**do**
8: [C_{i}^{∗},C^{∗}_{j}] = arg min

(C_{i},C_{j})∈C×C

d(C_{i},C_{j}) // find the closest pair of clusters
9: C_{k} =C_{i}^{∗}∪^{C}_{j}^{∗} // merge the closest pair of clusters
10: **endfor**

11: **endfunction**

*9.2.1* Strategies for merging clusters

A key component for agglomerative clustering is how we select the pair of clusters to be merged in each step (cf. line8of Algorithm5).

Chapter4provided a variety of distances that can be used to deter-mine the dissimilarity between apair of individual points. We would, however, require a methodology which assigns a distance for apair of clusters, with each clusters possibly consisting ofmultiple data points.

The choice of this strategy is important as different choices for calcu-lating inter-cluster distances might result in different result.

We could think of the inter-cluster distance as a measure which

c l u s t e r i n g 195

tells us the cost of merging a pair of clusters. In each iteration of agglomerative clustering, we are interested in selecting the pair of clusters with the lowest cost of being merged. There are multiple strategies one can follow when determining the inter-cluster dis-tances. We next review some of the frequently used strategies.

Let us assume thatCi andCjdenotes two clusters, each of which refers to a set ofm-dimensional points. Additionally, we have a dis-tance functiondthat we can use for quantifying the distance for any pair ofd-dimensional data points.

**Complete linkage**performs a pessimistic calculation for the
dis-tance between a pair of clusters as it is calculated as

d(Ci,Cj) = max

**x**_{i}∈C_{i},x_{j}∈C_{j}d(**x**i,**x**j),

meaning that the distance it assigns to a pair of clusters equals to the distance between the pair of most distant points from the two clusters.

**Single linkage**behaves oppositely to complete linkage in that it
measures the cost of merging two clusters as the smallest distance
between a pair of points from the clusters, i.e.,

d(C_{i},C_{j}) = min

**x**_{i}∈C_{i},x_{j}∈C_{j}d(**x**_{i},**x**_{j}).

The way**average linkage**computes the distance between a pair of
clusters is that it takes the pairwise between all pairs of data points
that can be formed from the members of the two clusters and simply
averages these pairwise distances out according to

d(Ci,Cj) = ^{1}

|^{C}i||^{C}j|

### ∑

**x**_{i}∈C_{i}

### ∑

**x**_{j}∈C_{j}

d(**x**i,**x**j).

A further option could be to identify the cost between a pair of clusters as

d(C_{i},C_{j}) = max

**x**∈C_{i}∪C_{j}d(**x,****µ**_{ij}),

where**µ**_{ij}denotes the mean of the data points that we would get if
we merged all the members of clusterC_{i}andC_{j} together, i.e.,

**µ**_{ij}= ^{1}

|^{C}i|+|^{C}j|

### ∑

**x**∈C_{i}∪C_{j}

**x.**

**Ward’s method**^{2}quantifies the amount of increase in the variation ^{2}Ward1963

that would be caused by merging a certain pair of clusters. That is,
d(C_{i},C_{j}) =

### ∑

**x**∈C_{i}∪C_{j}

k** ^{x}**−

*ijk*

^{µ}^{2}2−

### ∑

**x**∈C_{i}

k** ^{x}**−

*ik*

^{µ}^{2}2+

### ∑

**x**∈C_{j}

k** ^{x}**−

*jk*

^{µ}^{2}2

,

where**µ**_{ij}is the same as before ,**µ**_{i} = _{|}_{C}^{1}

i| ∑

**x**∈C_{i}

**x**and**µ**_{j} = _{|}_{C}^{1}

j| ∑

**x**∈C_{j}

**x.**

The formula applied in Ward’s method can be equivalently expressed in the more efficiently calculable form of

d(Ci,Cj) = |^{C}i||^{C}j|

|^{C}i|+|^{C}j|k*^{µ}*i−

*jk*

^{µ}^{2}2.

Applying Ward’s method has the advantage that it tends to produce more even-sized clusters.

Obviously, not only the strategy for determining the aggregated inter-cluster distances, but the choice for functiond– which deter-mines a distance over a pair of data points – also plays a decisive role in agglomerative clustering. In general, one could choose any distance measure for that, which could potentially affect the out-come of the clustering. For simplicity, we assume it throughout this chapter that the distance measure that we utilize is just the standard Euclidean distance.

*9.2.2* Hierarchical clustering via an example

We now illustrate the mechanism of hierarchical clustering for the
ex-ample2-dimensional dataset included in Table9.1. As mentioned
earlier, we would determine the distance between a pair of data
points by relying on their Euclidean (`_{2}) distance.

data point location
A (−^{3, 3})

B (−^{2, 2})
C (−5, 4)
D (1, 2)

E (2, 2)

Table9.1: Example2-dimensional clustering dataset.

Table9.2includes all the pairwise distances between the pairs of clusters throughout the algorithm. Since distances are symmetric, we make use of the upper and lower triangular part of the inter-cluster distance matrices in Table9.2to denote the distances obtained by

complete linkage and single linkage strategies, respectively. Why is it so that the distances in the upper and lower triangular of the inter-cluster distance matrix in Table9.2(a)are exactly the same?

We separately highlight the cost for the cheapest cluster mergers

### ?

for both the complete linkage and the single linkage strategies in the upper and lower triangular parts of the inter-cluster distance matrices in Table9.2. It is also worth noticing that many of the values in Table9.2do not change between two consecutive steps. This is something we could exploit for making the algorithm more effective.

The entire trajectory of hierarchical clustering can be visualized
by a**dendrogram, which acts as a tree-structured log visualizing**

c l u s t e r i n g 197

Table9.2: Pairwise cluster distances during the execution of hierarchical clustering. The upper and lower tri-angular of the matrix includes the between cluster distances obtained when using complete linkage and single linkage, respectively. Boxed distances indicate the pair of clusters that get merged in a particular step of hierarchical clustering.

the cluster mergers performed during hierarchical clustering. Fig- How would the dendrogram differ if we performed different strategies for determining the inter-cluster distances, such as single linkage or average linkage?

ure9.2(a)contains the dendrogram we get for the example dataset

### ?

introduced in Table9.1when using Euclidean distance and the com-plete linkage strategy for merging clusters. The lengths of the edges in Figure9.2(a)are proportional to the inter-cluster distance that was calculated for the particular pair of clusters.

Figure9.2(a)also illustrates a possible way to obtain an actual partitioning of the input data. We can introduce some fixed threshold – indicated by a red dashed line in Figure9.2(a)– and say that data points belonging to the same subtree after cutting the dendrogram

at the given threshold would form a cluster of data points. The pro- Which of the nice properties intro-duced in Section9.1.2is not met by the threshold-driven inducing of clusters?

posed threshold-based strategy would result in the cluster structure

### ?

which is also illustrated in Figure9.2(b).

A B C D E

distance

(a) The dendrogram structure. Distances are based on complete linkage.

(b) Geometric view of hierarchical clustering.

Figure9.2: Illustration of the hierarchi-cal cluster structure found for the data points from Table9.1.

*9.2.3* Finding representative element for a cluster

When we would like to determine a representative element for a
collection of elements, we can easily take their**centroid**which
sim-ply corresponds to the averaged representations of the data points
that belong to a particular cluster. There are cases, however, when
averaging cannot be performed due to the peculiarities of the data.

This could be the case when our objects are characterized by nominal attributes for instance.

The typical solution to handle this kind of situation is to determine
the**medoid**(also called**clustroid) of the cluster members instead of**
their centroids. The medoid is the element of some clusterCwhich
lies the closest to all the other data points from the same clusterCin
someaggregatedsense (e.g. after calculating the sum or maximum of
the within-cluster distances).

**Example9.1.** Suppose members of some cluster are described by the
follow-ing strfollow-ings: C={ecdab,abecb,aecdb,abcd}. We would like to calculate the
most representative element from that group, i.e., the member of the cluster
that is the least dissimilar from the other members.

When measuring the dissimilarity of strings, we could rely on the**edit**
**distance**(cf. Section*4.5). Table9.3*(a)contains all the pairwise edit
dis-tances for the members of the cluster.

Table*9.3*(b)contains the aggregated distances for each member of the
cluster according to multiple strategies, i.e., summing, taking the maximum
or the sum of squared distances of the within-cluster distances. According
to any of the aggregations, it seems that the objectaecdbis the least
dissim-ilar from the remaining data points in the given cluster, hence it should be
treated as the representative element of that cluster.

ecdab abecb aecdb abcd

ecdab 0 4 2 5

abecb 4 0 2 3

aecdb 2 2 0 3

abcd 5 3 3 0

(a) Pairwise distances between the cluster mem-bers.

Sum Max Sum of squares

ecdab 11 5 45

abecb 9 4 29

**aecdb** 7 3 17

abcd 11 5 43

(b) Different aggregation of the within-cluster distances for each cluster member.

Table9.3: Illustration of the calculation of the medoid of a cluster.

Example9.1might seem to suggest that the way we aggregate the within-cluster distances for obtaining the medoid of a cluster do not make a difference, i.e., the same object was the least dissimilar to all the other data points no matter whether we took the sum or the maximum or the squared sum of per-instance distances. Table9.4 includes such an example within-cluster distance matrix for which

c l u s t e r i n g 199

the way aggregation is performed makes a difference. Hence, we can conclude that aggregating the within-cluster distances differently, we could obtain different representative element for the same set of data points.

A B C D

A 0 3 1 5

B 3 0 4 3

C 1 4 0 6

D 5 3 6 0

(a) Pairwise within-cluster distances.

Sum Max Sum of squares

A **9** 5 35

B 10 **4** **34**

C 11 6 53

D 14 6 70

(b) Different aggregation strategies.

Table9.4: An example where matrix of within-cluster distances for which different ways of aggregation yields different medoids.

*9.2.4* On the effectiveness of agglomerative clustering

The way agglomerative clustering works is that it initially introduces ndistinct clusters, i.e., as many of them as many data points we have.

Since we are merging two clusters in a time, we can performn−1 merge steps before we find ourselves with a single gigantic cluster containing all the observations from our dataset.

In the first iteration, we havenclusters (one for each data points).

Then in the second iteration, we have to deal withn−1 clusters. In
general, during iterationiwe haven+1−^{i}clusters to choose the
most promising pair of clusters to merge. This is in line with the
observation that in the last iteration of the algorithm (remember, we
can performn−1 merge steps at most) we would need to merge
n+1−(n−^{1}) =2 clusters together (which is kind of a trivial task to
do).

Remember that deciding on the pair of clusters to be merged
to-gether can be performed inO(k^{2}), if the number of clusters to choose
from isk. Since

### ∑

n k=1k^{2}= ^{n}(n+1)(2n+1)

6 ,

we get that the total computation performed during agglomerative clustering is

n−1 i=1

### ∑

(n+1−^{i})^{2}=O(n^{3}).

Algorithms that are cubic in the input size are simply prohibitive for inputs that include more than a few thousands examples, hence they do not scale well to really massive datasets.

There is one way we could improve the performance of agglom-erative clustering. Notice that the most of the pairwise distances calculated for the actual iteration can be reutilized during the next

iteration. To see why this is the case, recall that the first iteration
re-quires the calculation ofO(n^{2})pairwise distances and by the end of
the first iteration we would end up havingn−1 clusters as a results
of merging a pair of clusters together.

We could then proceed by calculating all the pairwise distances for the(n−1)clusters that we are left, but we can observe that if we did so, we would actually do quite much repeated work regard-ing the calculation of the distances for those pairs of clusters that we had already considered during the previous iteration. Actually, it would suffice to calculate a new distance of the single cluster that we just created in the last iteration towards all the others that were not involved in the last merging step. So, the agglomerative clus-tering would require the calculation ofn−2 distances in its second iteration.

Storing inter-cluster distances in a**heap**data structure could hence
improve the performance of agglomerative clustering in a
mean-ingful way. The good property of heaps that they offerO(logh)
operations for insertion and modification withhdenoting the
num-ber of elements stored in the heap. Since this time we would store
pairwise inter-cluster distances in a heap,h = O(n^{2}), i.e., the
number of elements in our heap is upper-bounded by the squared
number of data points. This means that every operation would be
O(logn^{2}) = O(2 logn) = O(logn). Together with the fact that
the number of per iteration operations needed during
agglomera-tive clustering isO(n)and that the number of iteration performed
isO(n), we get that the total algorithm can be implemented in
O(n^{2}logn).

WhileO(n^{2}logn)is a noticable improvement overO(n^{3}), it is
still insufficient to scale for such cases when we have hundreds of
thousands of data points. In cases whenn > 10^{5}one could either
combine agglomerative clustering with some approximate technique,
such as the ones discussed in Chapter5, or resort to more efficient
clustering techniques to be introduced in the followings.