** 4 | D ISTANCES AND SIMILARITIES**

**4.6 Distances for distributions**

We finally turn our attention to distances that are especially used for measuring dissimilarity between probability distributions. These

distances can be used for example for feature selection^{2}, but they can ^{2}Euisun Choi and Chulhee Lee. Feature
extraction based on the Bhattacharyya
distance. Pattern Recognition,36(8):

1703–1709,2003. ISSN0031-3203. d o i: https://doi.org/10.1016/S0031 -3203(03)00035-9. URLhttp://www.

sciencedirect.com/science/article/

pii/S0031320303000359

also be used to measure the distance data points that are described by probability distributions.

The distances covered in the followings all work for both categori-cal and continuous distributions. Here, we discuss the variants of the distances for the discrete case only. We shall add, however, that the definitions provided next can be easily extended to the continuous case if we replace the summations by calculating integrals over the support of the random variables.

These distances can be both applied at the feature level as well as at the level of instances, i.e., the columns and rows of our data ma-trix, respectively. Here, we will illustrate the usage of these distances via the latter case, i.e., we would like to determine the similarity of objects that can be characterized by some categorical distribution. In the followings, we will assume that we are given with two categorical probability distributionsP(X)andQ(X)for the random variableX.

*4.6.1* Bhattacharyya and Hellinger distances
**Bhattacharyya distance**is defined as

dB(P,Q) =−^{ln}^{BC}(P,Q) (4.10)

withBC(P,Q)denoting the**Bhattacharyya coefficient, which can be**
calculated according to the formula

BC(P,Q) =

### ∑

x∈X

q

P(x)Q(x). (4.11)

The Bhattacharyya coefficient is maximized whenP(x) = Q(x)for allx ∈ X, in which case the coefficient simply equals ∑

x∈X

P(x) =1.0.

This is illustrated via the example of two Bernoulli distributionsP andQin Figure4.9. Recall that the Bernoulli distribution is a dis-crete probability distribution which assigns some fixed amount of

d i s ta n c e s a n d s i m i l a r i t i e s 69

10.8
0.6
Q ^{0.4}_{0.2}

0 0

0 BC(P,Q) 0.2

0.8 1 0.6

P 0.4

0.6 0.8 1

0.2 0.4

Figure4.9: The visualization of the Bhattacharyya coefficient for a pair of Bernoulli distributionsPandQ.

probability masspfor the random variable taking the value 1 and a probability mass of 1−pfor the random variable being equal to 0.

Bhattacharyya coefficient hence can be treated as a measure of similarity, i.e., the more similar two distributions are, the closer their Bhattacharyya coefficient is to one. This behavior of the Bhat-tacharyya coefficient further implies that whenever two distributions are the same, their Bhattacharyya distance is going to be equal to zero, as ln 1=0. The Bhattacharyya distance naturally obeys symme-try which follows from the commutative nature of multiplication and summation involved in the calculation of the distance.

Triangle inequality, however, does not hold to the Bhattacharyya distance. This is because the non-linear nature of the logarithm being applied in Eq. (4.10). The term involving the logarithm severely pun-ishes cases, when its arguments are small, i.e., close to zero, and adds an infinitesimally small penalty, wheneverP(x)≈Q(x)holds.

**Example4.7.** In order to illustrate that the triangle inequality does not
hold for the Bhattacharyya distance, consider the two Bernoulli distributions
P ∼ ^{Bernoulli}(0.2)and Q ∼ ^{Benroulli}(0.6)that are visualized in
Figure*4.10*(a). Applying the formula for the Bhattacharyya distance from
Eq.(4.10), we get

dB(P,Q) =−^{ln}(√

0.2∗^{0.6}+√

0.8∗^{0.4}) =0.092.

If triangle inequality hold, we would need to have
d_{B}(P,R) +d_{B}(R,Q)≥^{d}B(P,Q)
for any possible R.

However, this is clearly not the case as also depicted in Figure *4.10*(b),
where we can see that it is possible to find such a distribution R that the
Bhattacharyya distance dB(P,Q)exceeds the sum of Bhattacharyya
dis-tances dB(P,R)and dB(R,Q).

Indeed, the sum of Bhattacharyya distances gets minimized for R ∼ Bernoulli(0.4), i.e., when R lies “half way” between the distributions P and Q. In that case we get which is smaller than the previously calculated Bhattacharyya distance dB(P,Q) =0.092.

(a) Two Bernoulli distributionsPandQ.

0 0.2 0.4 0.6 0.8 1

(b) The direct Bhattacharyya distance ofP andQand that as a sum introducing an intermediate distributionR.

Figure4.10: Illustration for the Bhat-tacharyya distance not obeying the property of triangle inequality.

**Hellinger distance**is a close relative of the Bhattacharyya distance
for which triangle inequality holds and which is formally defined as

dH(P,Q) =^{q}1−^{BC}(P,Q), (4.12)

withBC(P,Q)denoting the same Bhattacharyya coefficient as al-ready defined in Eq. (4.11).

Figure4.11(b)illustrates via the distributions from Example4.7 that – contrarily to the Bhattacharyya distance – the Hellinger dis-tance fulfils the triangle inequality. Hellinger disdis-tance further differs from Bhattacharyya distance in its range, i.e., the former takes values from the interval[0, 1], whereas the latter takes values between zero and infinity.

An equivalent expression for the Hellinger distance reveals its rela-tion to the Euclidean distance, i.e., it can be alternatively expressed as

dH(P,Q) = √^{1}

treating some multivariate probability distributions ofPandQwith kpossible outcomes as vectors in thek-dimensional space. In order to

d i s ta n c e s a n d s i m i l a r i t i e s 71

(a) The two Bernoulli distributionsPandQ from Example4.7.

(b) The direct Hellinger distance ofPandQ and that as a sum introducing an intermedi-ate distributionR.

Figure4.11: Illustration for the Hellinger distance for the distribu-tions from Example4.7.

*4.6.2* Kullback-Leibler and Jensen-Shannon divergences

**Kullback-Leibler divergence**(KL divergence for short) is formally
given as
cancels out, so there is no problem with the log 0 in the expression.

In order the KL divergence to be defined,Q(x) = 0 has to imply P(x) = 0. Should the previous implication not hold, we cannot calculate KL divergence for the pair of distributionsPandQ. Recall that a similar implication in the reverse direction is not mandatory, i.e.,P(x) = 0 does not have to implyQ(x) = 0 in order the KL divergence between distributionsPandQto be quantifiable.

The previous property of KL divergence tells us that it is not a
symmetric function. Indeed, not only there exist distributionsP
andQfor whichKL(Pk^{Q}) 6= KL(Qk^{P}), but it is also possible that
KL(PkQ)is defined, whereasKL(QkP)is not.

It is useful to know that KL divergence can be understood as a difference between the cross-entropy of distributionsPandQand the Shannon entropy ofP, denoted as

KL(Pk^{Q}) =H(P;Q)−^{H}(P).

Cross entropy is a similar concept to Shannon entropy introduced earlier in Section3.3.1. Notice the slight notational difference that we employ for cross entropy (H(P;Q)) and joint entropy(H(P,Q)) discussed earlier in Section3.3.1. Formally, cross entropy is given as

H(P;Q) =−

### ∑

x∈X

P(x)logQ(x),

which quantifies the expected surprise factor for distributionQ as-suming that its possible outcomes are observed according to distri-butionP. In that sense, it is capable of quantifying the discrepancy between two distributions. This quantity is not symmetric in its ar-guments, i.e.,H(P;Q) = H(Q;P)does not have to hold, which again corroborates the non-symmetric nature of the KL divergence.

Cross entropy gets minimized whenP(x) = Q(x)holds for
ev-eryx from the support of the random variableX. It can be proven
by using either the**Gibbs inequality**or the**log sum inequality**that
H(P;Q) ≥ H(P)holds for any two distributionsPandQ, which
implies thatKL(PkQ) ≥ 0, with the equation being true when
P(X) = Q(X)for every value ofX. The latter observation follows
from the fact thatH(P;P) =H(P).

**Jensen-Shannon divergence**(JS divergence for short) is derived
from KL divergence, additionally obeying the symmetry property.

It is so closely related to KL divergence that it can essentially be expressed as an average of KL divergences as

JS(P,Q) = ^{1}

2(KL(Pk^{M}) +KL(Qk^{M})),
with Mdenoting the average distribution ofPandQ, i.e.,

M= ^{1}

2(P+Q).

**Example4.8.** Imagine Jack spends*75%,24% and1% of his spare time*
reading some book, going to the cinema and hiking. Two of Jack’s colleagues
– Mary and Paul – devote their spare time according to the probability
distributions[0.60, 0.30, 0.10]and[0.55, 0.40, 0.05]for the same activities.

Let us quantify which colleague of Jack likes to spend their spare time the least dissimilar to Jack, i.e., whose distribution lies the closest to that of Jack’s.

For measuring the distances between the distributions, let us make use of the different distances that we covered in this chapter, including the ones that are not specially designed to be used for probability distributions.

d i s ta n c e s a n d s i m i l a r i t i e s 73

The results of the calculations are included in Table*4.1*and the code
snippet written in vectorized style that could be used to reproduce the results
can be found in Figure*4.13. A crucial thing to notice is that depending*
which notion of distance we rely on, we arrive to different answers regarding
which colleague of Jack has a more similar preference for spending their
sparse time. What this implies that data mining algorithms which rely on
some notion of distance might result in different outputs if we modify how
we define the distance between the data points.

The penultimate row of Table*4.1*and the one before it also highlights the
non-symmetric nature of KL divergence.

R C H

1 **Paul** Figure4.12: Visualization of the

dis-tributions regarding the spare time activities in Example4.8. R, C and H along the x-axis refers to the activi-ties reading, going to the cinema and hiking, respectively.

Mary Paul
city block distance **0.300** 0.400
Euclidean distance **0.185** 0.259
Chebyshev distance **0.150** 0.200
cosine distance **0.204** 0.324
Bhattacharyya distance 0.030 **0.026**
Hellinger distance 0.171 **0.160**
KL(Jackkcolleague) **0.091** 0.094
KL(colleaguek^{Jack}) ^{0}.163 **0.114**
JS divergence 0.027 **0.025**

Table4.1: Various distances between the distribution regarding Jack’s leisure activities and his colleagues. The smaller values for each notions of distances is highlighted in bold.