** 3 | B ASIC CONCEPTS**

**3.3 Information theory and its application in data mining**

We next review important concepts from information theory and their potential utilization when dealing with datasets.

*3.3.1* Mutual information

**Mutual information**between two random variablesXandYis
for-mally given as

MI(X;Y) = (H(X) +H(Y))−^{H}(X,Y), (3.6)
i.e., it is the difference between the sum of the Shannon entropy for
the individual variables,H(X)andH(Y), and their joint entropy,
What mutual information tells us about a pair of random variables
XandYis the amount of uncertainty which can be eliminated once
the true value of either of the random variables is revealed to us. In
other words, knowing about the outcome of a random variable, this
is the amount of uncertainty which remains regarding the remaining
random variable.

**Example3.3.** Imagine there are two random variables describing the
weather conditions (W) and the mood of our boss (M). Throughout a
se-ries of*100*days, we jointly keep track of the daily weather and the mood of

our boss and we record the**contingency table**^{10}given in Table*3.5.* ^{10}Contingency tables store the
obser-vation statistics of multiple random
variables.

M=happy M=blue **Total**

W=sunny 38 7 **45**

W=rainy 12 43 **55**

**Total** **50** **50** **100**

Table3.5: An example contingency table for two variables.

As illustrated by Table*3.5, without knowing anything about the weather*
conditions, we are totally clueless about the mood of our boss on average

b a s i c c o n c e p t s 47

**Shannon entropy**is a quantity which tells us about theexpected
unpredictability of some random variableX. WhenXis a discrete
random variable, it is formally defined as

H(X) =

### ∑

x∈X

P(X=x)log_{2} 1

P(X=x)^{.} ^{(}^{3}^{.}^{7}^{)}

The logarithmic part in Eq. (3.7), as illustrated in Figure3.8, can be
thought as a quantity which measures how surprised we get when
we observe an event with probabilityP(X= x). Indeed, if we assume
that observing a certain event has probability1.0, we should not get
surprised at all since log_{2}^{1}_{1} = 0. On the other hand, if we observe
something with an infinitesimally small value*e, we should definitely*
become very surprised by the observation of such an event, which is
also reflected by the quantity lim

*e→0*log_{2}^{1}* _{e}* =∞.

Summing this quantity over the possible range ofXand weighting each term with the probability of observing that value can be inter-preted as an expected amount of surprise.

When we have more than just a single random variable, sayXand Y, the concept of entropy can naturally be expanded to get their so-called joint entropy, i.e.,

H(X,Y) =

### ∑

x∈X

### ∑

y∈Y

P(X=x,Y=y)log_{2} 1

P(X=x,Y=y)^{.}
**M****ATH****R****EVIEW****| S****HANNON ENTROPY**

Figure3.7: Shannon entropy

since the marginal distribution of the mood random variable behaves totally unpredictably, i.e.,

P(M=happy) =P(M=blue) =0.5.

This unpredictably is also reflected by the fact that the entropy of the random variable M takes its maximal possible value, i.e.,

H(M) =−^{0.5 log}20.5−^{0.5 log}20.5=1.

Analogous calculations result in H(W) =0.993, H(M,W) =1.690and MI(M,W) =0.303.

*3.3.2* Applying mutual information to data mining: Feature selection
Imagine that we are performing**classification, that is, we are given**
a series of multidimensional vectors that describe objects based on
their predictive features, based on which, our goal is to infer some
special categorical target variable about the objects. For instance,

0 0.2 0.4 0.6 0.8 1 0

5 10 15 20

probability of an observation

surprise factor

Figure3.8: The amount of surprise for some event as a function of the probability of the event.

we might be given information about the customers of a financial
institution who apply for a loan, and we want to decide in advance
for an upcoming applicant who is described by the feature vector**x**** _{i}**
whether this customer is a safe choice to provide the loan for. In that
case, the categorical variable that we are about to predict would be a
binary one, indicating whether the applicant is likely going to be able
to repay the load (class labelYes) or not (class labelNo).

While solving a classification problem, we might want to reduce the number of predictive features or simply order them according to their perceived utility towards predicting the target class variable.

The goal of**feature selection**is to choose the best subset of the
pre-dictors/features for our data mining application. Calculating mutual
information is one of the many options to quantify the usefulness of
predictive features towards a target variable (often denoted byYin
the literature).

Molina et al.[2002]^{11}provides a thorough survey of the alternative ^{11}Luis Carlos Molina, Lluís Belanche,
and Àngela Nebot. Feature selection
algorithms: A survey and experimental
evaluation. InProceedings of the*2002*
IEEE International Conference on Data
Mining, ICDM ’02, pages306–,
Wash-ington, DC, USA,2002. IEEE Computer
Society. ISBN0-7695-1754-4. URL
http://dl.acm.org/citation.cfm?id=

844380.844722

approaches for finding the best performing subset of the predictive
features for a given task. Note that the task of feature selection is a
hard problem, since when we havemfeatures, there are
exponen-tially many (2^{m}) possibilities to formulate subsets of features, for
which reason heuristics to speed up the process are employed most
of the times.

*3.3.3* Applying mutual information to data mining: Feature discretization
When representing our data with continuous features, it is sometimes
also desired to turn the continuous features into discrete ones. This
process is called**feature discretization, meaning that instead of **
mea-suring the actual numeric values for a particular random variable,
we transform its range into discrete bins and create a feature value

b a s i c c o n c e p t s 49

which indicates which particular interval of values a particular ob-servation falls into. That is, instead of treating the salary of a person as a specific numeric value, we can form three bins of the salaries observed in our dataset (low, medium and high) and represent the salaries of the individuals by the range it falls.

The question is now, how to determine the intervals which form the discrete bins for a random variable? There are multiple answers to this question. Some of the approaches are uninformed (also called unsupervised) in the sense that the bins we split the range of our ran-dom variable is formed without considering that the different feature values might describe such data points that belong to a different tar-get class variabley∈ Y. These simple forms of feature discretization might strive for determining such bins of feature ranges that an equal amount of observations belong into each bin. A different form of partitioning can go along the formulation of bins with equal widths.

These forms of equipartitioning approaches all have their potential drawbacks that information theory–based approaches can remedy.

A more principled way for feature discretization relies on mutual information. We can quantify with the help of mutual information the effects of performing discretization overXwhen assuming dif-ferent values as boundaries for our bins. By calculating the various mutual information scores that we get if we perform the discretiza-tion foXat various thresholds, we can select the boundary that is the most advantageous.

The mutual information–based feature discretization operates by calculating mutual information between a categorical class labelY and the discretized versions ofXthat we obtain by binning the ob-servation into discrete categories at different thresholds. The choice for the threshold providing us with the highest mutual information can be regarded as the most meaningful way to form the different discrete intervals of our initially numeric variableX. Notice that this mutual information–based approach is more informed compared to the simple unsupervised equipartitioning approach, since it also re-lies on the class labels of the observationsY, not only the distribution ofX. For this reason the mutual information–based discretization belongs to the family of informed (or supervised) discretization tech-niques.

**Example3.4.** Imagine that we have some numeric feature that we have
*10*measurements from originating from*10*distinct instances. The actual
numeric values observed for the*10*data points are depicted in Figure*3.9.*

Besides the actual values feature X takes on for the different observations,
Figure*3.9*also reveals their class label Y. This information is encoded by the
color of the dots representing each observation.

Let us compare the cases when we form the two bins of the feature X to be