Distances for distributions

In document DATAMINING GÁBORBEREND (Pldal 68-73)

4 | D ISTANCES AND SIMILARITIES

4.6 Distances for distributions

We finally turn our attention to distances that are especially used for measuring dissimilarity between probability distributions. These

distances can be used for example for feature selection2, but they can 2Euisun Choi and Chulhee Lee. Feature extraction based on the Bhattacharyya distance. Pattern Recognition,36(8):

17031709,2003. ISSN0031-3203. d o i: https://doi.org/10.1016/S0031 -3203(03)00035-9. URLhttp://www.

sciencedirect.com/science/article/

pii/S0031320303000359

also be used to measure the distance data points that are described by probability distributions.

The distances covered in the followings all work for both categori-cal and continuous distributions. Here, we discuss the variants of the distances for the discrete case only. We shall add, however, that the definitions provided next can be easily extended to the continuous case if we replace the summations by calculating integrals over the support of the random variables.

These distances can be both applied at the feature level as well as at the level of instances, i.e., the columns and rows of our data ma-trix, respectively. Here, we will illustrate the usage of these distances via the latter case, i.e., we would like to determine the similarity of objects that can be characterized by some categorical distribution. In the followings, we will assume that we are given with two categorical probability distributionsP(X)andQ(X)for the random variableX.

4.6.1 Bhattacharyya and Hellinger distances Bhattacharyya distanceis defined as

dB(P,Q) =−lnBC(P,Q) (4.10)

withBC(P,Q)denoting theBhattacharyya coefficient, which can be calculated according to the formula

BC(P,Q) =

xX

q

P(x)Q(x). (4.11)

The Bhattacharyya coefficient is maximized whenP(x) = Q(x)for allx ∈ X, in which case the coefficient simply equals ∑

xX

P(x) =1.0.

This is illustrated via the example of two Bernoulli distributionsP andQin Figure4.9. Recall that the Bernoulli distribution is a dis-crete probability distribution which assigns some fixed amount of

d i s ta n c e s a n d s i m i l a r i t i e s 69

10.8 0.6 Q 0.40.2

0 0

0 BC(P,Q) 0.2

0.8 1 0.6

P 0.4

0.6 0.8 1

0.2 0.4

Figure4.9: The visualization of the Bhattacharyya coefficient for a pair of Bernoulli distributionsPandQ.

probability masspfor the random variable taking the value 1 and a probability mass of 1−pfor the random variable being equal to 0.

Bhattacharyya coefficient hence can be treated as a measure of similarity, i.e., the more similar two distributions are, the closer their Bhattacharyya coefficient is to one. This behavior of the Bhat-tacharyya coefficient further implies that whenever two distributions are the same, their Bhattacharyya distance is going to be equal to zero, as ln 1=0. The Bhattacharyya distance naturally obeys symme-try which follows from the commutative nature of multiplication and summation involved in the calculation of the distance.

Triangle inequality, however, does not hold to the Bhattacharyya distance. This is because the non-linear nature of the logarithm being applied in Eq. (4.10). The term involving the logarithm severely pun-ishes cases, when its arguments are small, i.e., close to zero, and adds an infinitesimally small penalty, wheneverP(x)≈Q(x)holds.

Example4.7. In order to illustrate that the triangle inequality does not hold for the Bhattacharyya distance, consider the two Bernoulli distributions P ∼ Bernoulli(0.2)and Q ∼ Benroulli(0.6)that are visualized in Figure4.10(a). Applying the formula for the Bhattacharyya distance from Eq.(4.10), we get

dB(P,Q) =−ln(√

0.2∗0.6+√

0.8∗0.4) =0.092.

If triangle inequality hold, we would need to have dB(P,R) +dB(R,Q)≥dB(P,Q) for any possible R.

However, this is clearly not the case as also depicted in Figure 4.10(b), where we can see that it is possible to find such a distribution R that the Bhattacharyya distance dB(P,Q)exceeds the sum of Bhattacharyya dis-tances dB(P,R)and dB(R,Q).

Indeed, the sum of Bhattacharyya distances gets minimized for R ∼ Bernoulli(0.4), i.e., when R lies “half way” between the distributions P and Q. In that case we get which is smaller than the previously calculated Bhattacharyya distance dB(P,Q) =0.092.

(a) Two Bernoulli distributionsPandQ.

0 0.2 0.4 0.6 0.8 1

(b) The direct Bhattacharyya distance ofP andQand that as a sum introducing an intermediate distributionR.

Figure4.10: Illustration for the Bhat-tacharyya distance not obeying the property of triangle inequality.

Hellinger distanceis a close relative of the Bhattacharyya distance for which triangle inequality holds and which is formally defined as

dH(P,Q) =q1−BC(P,Q), (4.12)

withBC(P,Q)denoting the same Bhattacharyya coefficient as al-ready defined in Eq. (4.11).

Figure4.11(b)illustrates via the distributions from Example4.7 that – contrarily to the Bhattacharyya distance – the Hellinger dis-tance fulfils the triangle inequality. Hellinger disdis-tance further differs from Bhattacharyya distance in its range, i.e., the former takes values from the interval[0, 1], whereas the latter takes values between zero and infinity.

An equivalent expression for the Hellinger distance reveals its rela-tion to the Euclidean distance, i.e., it can be alternatively expressed as

dH(P,Q) = √1

treating some multivariate probability distributions ofPandQwith kpossible outcomes as vectors in thek-dimensional space. In order to

d i s ta n c e s a n d s i m i l a r i t i e s 71

(a) The two Bernoulli distributionsPandQ from Example4.7.

(b) The direct Hellinger distance ofPandQ and that as a sum introducing an intermedi-ate distributionR.

Figure4.11: Illustration for the Hellinger distance for the distribu-tions from Example4.7.

4.6.2 Kullback-Leibler and Jensen-Shannon divergences

Kullback-Leibler divergence(KL divergence for short) is formally given as cancels out, so there is no problem with the log 0 in the expression.

In order the KL divergence to be defined,Q(x) = 0 has to imply P(x) = 0. Should the previous implication not hold, we cannot calculate KL divergence for the pair of distributionsPandQ. Recall that a similar implication in the reverse direction is not mandatory, i.e.,P(x) = 0 does not have to implyQ(x) = 0 in order the KL divergence between distributionsPandQto be quantifiable.

The previous property of KL divergence tells us that it is not a symmetric function. Indeed, not only there exist distributionsP andQfor whichKL(PkQ) 6= KL(QkP), but it is also possible that KL(PkQ)is defined, whereasKL(QkP)is not.

It is useful to know that KL divergence can be understood as a difference between the cross-entropy of distributionsPandQand the Shannon entropy ofP, denoted as

KL(PkQ) =H(P;Q)−H(P).

Cross entropy is a similar concept to Shannon entropy introduced earlier in Section3.3.1. Notice the slight notational difference that we employ for cross entropy (H(P;Q)) and joint entropy(H(P,Q)) discussed earlier in Section3.3.1. Formally, cross entropy is given as

H(P;Q) =−

xX

P(x)logQ(x),

which quantifies the expected surprise factor for distributionQ as-suming that its possible outcomes are observed according to distri-butionP. In that sense, it is capable of quantifying the discrepancy between two distributions. This quantity is not symmetric in its ar-guments, i.e.,H(P;Q) = H(Q;P)does not have to hold, which again corroborates the non-symmetric nature of the KL divergence.

Cross entropy gets minimized whenP(x) = Q(x)holds for ev-eryx from the support of the random variableX. It can be proven by using either theGibbs inequalityor thelog sum inequalitythat H(P;Q) ≥ H(P)holds for any two distributionsPandQ, which implies thatKL(PkQ) ≥ 0, with the equation being true when P(X) = Q(X)for every value ofX. The latter observation follows from the fact thatH(P;P) =H(P).

Jensen-Shannon divergence(JS divergence for short) is derived from KL divergence, additionally obeying the symmetry property.

It is so closely related to KL divergence that it can essentially be expressed as an average of KL divergences as

JS(P,Q) = 1

2(KL(PkM) +KL(QkM)), with Mdenoting the average distribution ofPandQ, i.e.,

M= 1

2(P+Q).

Example4.8. Imagine Jack spends75%,24% and1% of his spare time reading some book, going to the cinema and hiking. Two of Jack’s colleagues – Mary and Paul – devote their spare time according to the probability distributions[0.60, 0.30, 0.10]and[0.55, 0.40, 0.05]for the same activities.

Let us quantify which colleague of Jack likes to spend their spare time the least dissimilar to Jack, i.e., whose distribution lies the closest to that of Jack’s.

For measuring the distances between the distributions, let us make use of the different distances that we covered in this chapter, including the ones that are not specially designed to be used for probability distributions.

d i s ta n c e s a n d s i m i l a r i t i e s 73

The results of the calculations are included in Table4.1and the code snippet written in vectorized style that could be used to reproduce the results can be found in Figure4.13. A crucial thing to notice is that depending which notion of distance we rely on, we arrive to different answers regarding which colleague of Jack has a more similar preference for spending their sparse time. What this implies that data mining algorithms which rely on some notion of distance might result in different outputs if we modify how we define the distance between the data points.

The penultimate row of Table4.1and the one before it also highlights the non-symmetric nature of KL divergence.

R C H

1 Paul Figure4.12: Visualization of the

dis-tributions regarding the spare time activities in Example4.8. R, C and H along the x-axis refers to the activi-ties reading, going to the cinema and hiking, respectively.

Mary Paul city block distance 0.300 0.400 Euclidean distance 0.185 0.259 Chebyshev distance 0.150 0.200 cosine distance 0.204 0.324 Bhattacharyya distance 0.030 0.026 Hellinger distance 0.171 0.160 KL(Jackkcolleague) 0.091 0.094 KL(colleaguekJack) 0.163 0.114 JS divergence 0.027 0.025

Table4.1: Various distances between the distribution regarding Jack’s leisure activities and his colleagues. The smaller values for each notions of distances is highlighted in bold.

In document DATAMINING GÁBORBEREND (Pldal 68-73)