# Agglomerative clustering

In document DATAMINING GÁBORBEREND (Pldal 194-200)

## ? 6.1 The curse of dimensionality

### 9.2 Agglomerative clustering

The first family of clustering techniques we introduce is agglomera-tive clustering. The way these clustering algorithms work is that they initially assign each and every data point into a cluster of its own then they gradually start merging them together until all the clusters belong into a single cluster. This bottom-up strategy builds up a hierarchy based on the arrangement of the data points and this is why this kind of approach is also referred to ashierarchic clustering.

The pseudocode for agglomerative clustering is provided in Al-gorithm5. It illustrates that in the beginning every data point is assigned to a unique cluster which then get merged into a hierarchi-cal structure by repeated mergers of pairs of clusters based on their inter-cluster distance. Applying different strategies for determining the inter-cluster distances could produce different clustering out-comes. Hence, an important question is how do we determine these inter-cluster distances that we shall discuss next.

Algorithm5: Pseudocode for agglom-erative clustering.

Require: Data pointsD

Ensure: Hierarchic clustering ofD

1: functionAg g l o m e r at i v eCl u s t e r i n g(D)

2: i=0

3: ford∈ Ddo

4: i=i+1

5: Ci ={d} // each data point gets assigned to an individual cluster 6: endfor

7: for(k=1; k < i; ++k)do 8: [Ci,Cj] = arg min

(Ci,Cj)C×C

d(Ci,Cj) // find the closest pair of clusters 9: Ck =CiCj // merge the closest pair of clusters 10: endfor

11: endfunction

9.2.1 Strategies for merging clusters

A key component for agglomerative clustering is how we select the pair of clusters to be merged in each step (cf. line8of Algorithm5).

Chapter4provided a variety of distances that can be used to deter-mine the dissimilarity between apair of individual points. We would, however, require a methodology which assigns a distance for apair of clusters, with each clusters possibly consisting ofmultiple data points.

The choice of this strategy is important as different choices for calcu-lating inter-cluster distances might result in different result.

We could think of the inter-cluster distance as a measure which

c l u s t e r i n g 195

tells us the cost of merging a pair of clusters. In each iteration of agglomerative clustering, we are interested in selecting the pair of clusters with the lowest cost of being merged. There are multiple strategies one can follow when determining the inter-cluster dis-tances. We next review some of the frequently used strategies.

Let us assume thatCi andCjdenotes two clusters, each of which refers to a set ofm-dimensional points. Additionally, we have a dis-tance functiondthat we can use for quantifying the distance for any pair ofd-dimensional data points.

Complete linkageperforms a pessimistic calculation for the dis-tance between a pair of clusters as it is calculated as

d(Ci,Cj) = max

xiCi,xjCjd(xi,xj),

meaning that the distance it assigns to a pair of clusters equals to the distance between the pair of most distant points from the two clusters.

Single linkagebehaves oppositely to complete linkage in that it measures the cost of merging two clusters as the smallest distance between a pair of points from the clusters, i.e.,

d(Ci,Cj) = min

xiCi,xjCjd(xi,xj).

The wayaverage linkagecomputes the distance between a pair of clusters is that it takes the pairwise between all pairs of data points that can be formed from the members of the two clusters and simply averages these pairwise distances out according to

d(Ci,Cj) = 1

|Ci||Cj|

xiCi

### ∑

xjCj

d(xi,xj).

A further option could be to identify the cost between a pair of clusters as

d(Ci,Cj) = max

xCiCjd(x,µij),

whereµijdenotes the mean of the data points that we would get if we merged all the members of clusterCiandCj together, i.e.,

µij= 1

|Ci|+|Cj|

### ∑

xCiCj

x.

Ward’s method2quantifies the amount of increase in the variation 2Ward1963

that would be caused by merging a certain pair of clusters. That is, d(Ci,Cj) =

xCiCj

kxµijk22

xCi

kxµik22+

### ∑

xCj

kxµjk22

,

whereµijis the same as before ,µi = |C1

i|

xCi

xandµj = |C1

j|

xCj

x.

The formula applied in Ward’s method can be equivalently expressed in the more efficiently calculable form of

d(Ci,Cj) = |Ci||Cj|

|Ci|+|Cj|kµiµjk22.

Applying Ward’s method has the advantage that it tends to produce more even-sized clusters.

Obviously, not only the strategy for determining the aggregated inter-cluster distances, but the choice for functiond– which deter-mines a distance over a pair of data points – also plays a decisive role in agglomerative clustering. In general, one could choose any distance measure for that, which could potentially affect the out-come of the clustering. For simplicity, we assume it throughout this chapter that the distance measure that we utilize is just the standard Euclidean distance.

9.2.2 Hierarchical clustering via an example

We now illustrate the mechanism of hierarchical clustering for the ex-ample2-dimensional dataset included in Table9.1. As mentioned earlier, we would determine the distance between a pair of data points by relying on their Euclidean (`2) distance.

data point location A (−3, 3)

B (−2, 2) C (−5, 4) D (1, 2)

E (2, 2)

Table9.1: Example2-dimensional clustering dataset.

Table9.2includes all the pairwise distances between the pairs of clusters throughout the algorithm. Since distances are symmetric, we make use of the upper and lower triangular part of the inter-cluster distance matrices in Table9.2to denote the distances obtained by

complete linkage and single linkage strategies, respectively. Why is it so that the distances in the upper and lower triangular of the inter-cluster distance matrix in Table9.2(a)are exactly the same?

We separately highlight the cost for the cheapest cluster mergers

### ?

for both the complete linkage and the single linkage strategies in the upper and lower triangular parts of the inter-cluster distance matrices in Table9.2. It is also worth noticing that many of the values in Table9.2do not change between two consecutive steps. This is something we could exploit for making the algorithm more effective.

The entire trajectory of hierarchical clustering can be visualized by adendrogram, which acts as a tree-structured log visualizing

c l u s t e r i n g 197

Table9.2: Pairwise cluster distances during the execution of hierarchical clustering. The upper and lower tri-angular of the matrix includes the between cluster distances obtained when using complete linkage and single linkage, respectively. Boxed distances indicate the pair of clusters that get merged in a particular step of hierarchical clustering.

the cluster mergers performed during hierarchical clustering. Fig- How would the dendrogram differ if we performed different strategies for determining the inter-cluster distances, such as single linkage or average linkage?

ure9.2(a)contains the dendrogram we get for the example dataset

### ?

introduced in Table9.1when using Euclidean distance and the com-plete linkage strategy for merging clusters. The lengths of the edges in Figure9.2(a)are proportional to the inter-cluster distance that was calculated for the particular pair of clusters.

Figure9.2(a)also illustrates a possible way to obtain an actual partitioning of the input data. We can introduce some fixed threshold – indicated by a red dashed line in Figure9.2(a)– and say that data points belonging to the same subtree after cutting the dendrogram

at the given threshold would form a cluster of data points. The pro- Which of the nice properties intro-duced in Section9.1.2is not met by the threshold-driven inducing of clusters?

posed threshold-based strategy would result in the cluster structure

### ?

which is also illustrated in Figure9.2(b).

A B C D E

distance

(a) The dendrogram structure. Distances are based on complete linkage.

(b) Geometric view of hierarchical clustering.

Figure9.2: Illustration of the hierarchi-cal cluster structure found for the data points from Table9.1.

9.2.3 Finding representative element for a cluster

When we would like to determine a representative element for a collection of elements, we can easily take theircentroidwhich sim-ply corresponds to the averaged representations of the data points that belong to a particular cluster. There are cases, however, when averaging cannot be performed due to the peculiarities of the data.

This could be the case when our objects are characterized by nominal attributes for instance.

The typical solution to handle this kind of situation is to determine themedoid(also calledclustroid) of the cluster members instead of their centroids. The medoid is the element of some clusterCwhich lies the closest to all the other data points from the same clusterCin someaggregatedsense (e.g. after calculating the sum or maximum of the within-cluster distances).

Example9.1. Suppose members of some cluster are described by the follow-ing strfollow-ings: C={ecdab,abecb,aecdb,abcd}. We would like to calculate the most representative element from that group, i.e., the member of the cluster that is the least dissimilar from the other members.

When measuring the dissimilarity of strings, we could rely on theedit distance(cf. Section4.5). Table9.3(a)contains all the pairwise edit dis-tances for the members of the cluster.

Table9.3(b)contains the aggregated distances for each member of the cluster according to multiple strategies, i.e., summing, taking the maximum or the sum of squared distances of the within-cluster distances. According to any of the aggregations, it seems that the objectaecdbis the least dissim-ilar from the remaining data points in the given cluster, hence it should be treated as the representative element of that cluster.

ecdab abecb aecdb abcd

ecdab 0 4 2 5

abecb 4 0 2 3

aecdb 2 2 0 3

abcd 5 3 3 0

(a) Pairwise distances between the cluster mem-bers.

Sum Max Sum of squares

ecdab 11 5 45

abecb 9 4 29

aecdb 7 3 17

abcd 11 5 43

(b) Different aggregation of the within-cluster distances for each cluster member.

Table9.3: Illustration of the calculation of the medoid of a cluster.

Example9.1might seem to suggest that the way we aggregate the within-cluster distances for obtaining the medoid of a cluster do not make a difference, i.e., the same object was the least dissimilar to all the other data points no matter whether we took the sum or the maximum or the squared sum of per-instance distances. Table9.4 includes such an example within-cluster distance matrix for which

c l u s t e r i n g 199

the way aggregation is performed makes a difference. Hence, we can conclude that aggregating the within-cluster distances differently, we could obtain different representative element for the same set of data points.

A B C D

A 0 3 1 5

B 3 0 4 3

C 1 4 0 6

D 5 3 6 0

(a) Pairwise within-cluster distances.

Sum Max Sum of squares

A 9 5 35

B 10 4 34

C 11 6 53

D 14 6 70

(b) Different aggregation strategies.

Table9.4: An example where matrix of within-cluster distances for which different ways of aggregation yields different medoids.

9.2.4 On the effectiveness of agglomerative clustering

The way agglomerative clustering works is that it initially introduces ndistinct clusters, i.e., as many of them as many data points we have.

Since we are merging two clusters in a time, we can performn−1 merge steps before we find ourselves with a single gigantic cluster containing all the observations from our dataset.

In the first iteration, we havenclusters (one for each data points).

Then in the second iteration, we have to deal withn−1 clusters. In general, during iterationiwe haven+1−iclusters to choose the most promising pair of clusters to merge. This is in line with the observation that in the last iteration of the algorithm (remember, we can performn−1 merge steps at most) we would need to merge n+1−(n−1) =2 clusters together (which is kind of a trivial task to do).

Remember that deciding on the pair of clusters to be merged to-gether can be performed inO(k2), if the number of clusters to choose from isk. Since

### ∑

n k=1

k2= n(n+1)(2n+1)

6 ,

we get that the total computation performed during agglomerative clustering is

n1 i=1

### ∑

(n+1−i)2=O(n3).

Algorithms that are cubic in the input size are simply prohibitive for inputs that include more than a few thousands examples, hence they do not scale well to really massive datasets.

There is one way we could improve the performance of agglom-erative clustering. Notice that the most of the pairwise distances calculated for the actual iteration can be reutilized during the next

iteration. To see why this is the case, recall that the first iteration re-quires the calculation ofO(n2)pairwise distances and by the end of the first iteration we would end up havingn−1 clusters as a results of merging a pair of clusters together.

We could then proceed by calculating all the pairwise distances for the(n−1)clusters that we are left, but we can observe that if we did so, we would actually do quite much repeated work regard-ing the calculation of the distances for those pairs of clusters that we had already considered during the previous iteration. Actually, it would suffice to calculate a new distance of the single cluster that we just created in the last iteration towards all the others that were not involved in the last merging step. So, the agglomerative clus-tering would require the calculation ofn−2 distances in its second iteration.

Storing inter-cluster distances in aheapdata structure could hence improve the performance of agglomerative clustering in a mean-ingful way. The good property of heaps that they offerO(logh) operations for insertion and modification withhdenoting the num-ber of elements stored in the heap. Since this time we would store pairwise inter-cluster distances in a heap,h = O(n2), i.e., the number of elements in our heap is upper-bounded by the squared number of data points. This means that every operation would be O(logn2) = O(2 logn) = O(logn). Together with the fact that the number of per iteration operations needed during agglomera-tive clustering isO(n)and that the number of iteration performed isO(n), we get that the total algorithm can be implemented in O(n2logn).

WhileO(n2logn)is a noticable improvement overO(n3), it is still insufficient to scale for such cases when we have hundreds of thousands of data points. In cases whenn > 105one could either combine agglomerative clustering with some approximate technique, such as the ones discussed in Chapter5, or resort to more efficient clustering techniques to be introduced in the followings.

In document DATAMINING GÁBORBEREND (Pldal 194-200)