# Information theory and its application in data mining

In document DATAMINING GÁBORBEREND (Pldal 46-50)

## 3 | B ASIC CONCEPTS

### 3.3 Information theory and its application in data mining

We next review important concepts from information theory and their potential utilization when dealing with datasets.

3.3.1 Mutual information

Mutual informationbetween two random variablesXandYis for-mally given as

MI(X;Y) = (H(X) +H(Y))−H(X,Y), (3.6) i.e., it is the difference between the sum of the Shannon entropy for the individual variables,H(X)andH(Y), and their joint entropy, What mutual information tells us about a pair of random variables XandYis the amount of uncertainty which can be eliminated once the true value of either of the random variables is revealed to us. In other words, knowing about the outcome of a random variable, this is the amount of uncertainty which remains regarding the remaining random variable.

Example3.3. Imagine there are two random variables describing the weather conditions (W) and the mood of our boss (M). Throughout a se-ries of100days, we jointly keep track of the daily weather and the mood of

our boss and we record thecontingency table10given in Table3.5. 10Contingency tables store the obser-vation statistics of multiple random variables.

M=happy M=blue Total

W=sunny 38 7 45

W=rainy 12 43 55

Total 50 50 100

Table3.5: An example contingency table for two variables.

As illustrated by Table3.5, without knowing anything about the weather conditions, we are totally clueless about the mood of our boss on average

b a s i c c o n c e p t s 47

Shannon entropyis a quantity which tells us about theexpected unpredictability of some random variableX. WhenXis a discrete random variable, it is formally defined as

H(X) =

### ∑

xX

P(X=x)log2 1

P(X=x). (3.7)

The logarithmic part in Eq. (3.7), as illustrated in Figure3.8, can be thought as a quantity which measures how surprised we get when we observe an event with probabilityP(X= x). Indeed, if we assume that observing a certain event has probability1.0, we should not get surprised at all since log211 = 0. On the other hand, if we observe something with an infinitesimally small valuee, we should definitely become very surprised by the observation of such an event, which is also reflected by the quantity lim

e→0log21e =∞.

Summing this quantity over the possible range ofXand weighting each term with the probability of observing that value can be inter-preted as an expected amount of surprise.

When we have more than just a single random variable, sayXand Y, the concept of entropy can naturally be expanded to get their so-called joint entropy, i.e.,

H(X,Y) =

xX

### ∑

yY

P(X=x,Y=y)log2 1

P(X=x,Y=y). MATHREVIEW| SHANNON ENTROPY

Figure3.7: Shannon entropy

since the marginal distribution of the mood random variable behaves totally unpredictably, i.e.,

P(M=happy) =P(M=blue) =0.5.

This unpredictably is also reflected by the fact that the entropy of the random variable M takes its maximal possible value, i.e.,

H(M) =−0.5 log20.5−0.5 log20.5=1.

Analogous calculations result in H(W) =0.993, H(M,W) =1.690and MI(M,W) =0.303.

3.3.2 Applying mutual information to data mining: Feature selection Imagine that we are performingclassification, that is, we are given a series of multidimensional vectors that describe objects based on their predictive features, based on which, our goal is to infer some special categorical target variable about the objects. For instance,

0 0.2 0.4 0.6 0.8 1 0

5 10 15 20

probability of an observation

surprise factor

Figure3.8: The amount of surprise for some event as a function of the probability of the event.

we might be given information about the customers of a financial institution who apply for a loan, and we want to decide in advance for an upcoming applicant who is described by the feature vectorxi whether this customer is a safe choice to provide the loan for. In that case, the categorical variable that we are about to predict would be a binary one, indicating whether the applicant is likely going to be able to repay the load (class labelYes) or not (class labelNo).

While solving a classification problem, we might want to reduce the number of predictive features or simply order them according to their perceived utility towards predicting the target class variable.

The goal offeature selectionis to choose the best subset of the pre-dictors/features for our data mining application. Calculating mutual information is one of the many options to quantify the usefulness of predictive features towards a target variable (often denoted byYin the literature).

Molina et al.11provides a thorough survey of the alternative 11Luis Carlos Molina, Lluís Belanche, and Àngela Nebot. Feature selection algorithms: A survey and experimental evaluation. InProceedings of the2002 IEEE International Conference on Data Mining, ICDM ’02, pages306–, Wash-ington, DC, USA,2002. IEEE Computer Society. ISBN0-7695-1754-4. URL http://dl.acm.org/citation.cfm?id=

844380.844722

approaches for finding the best performing subset of the predictive features for a given task. Note that the task of feature selection is a hard problem, since when we havemfeatures, there are exponen-tially many (2m) possibilities to formulate subsets of features, for which reason heuristics to speed up the process are employed most of the times.

3.3.3 Applying mutual information to data mining: Feature discretization When representing our data with continuous features, it is sometimes also desired to turn the continuous features into discrete ones. This process is calledfeature discretization, meaning that instead of mea-suring the actual numeric values for a particular random variable, we transform its range into discrete bins and create a feature value

b a s i c c o n c e p t s 49

which indicates which particular interval of values a particular ob-servation falls into. That is, instead of treating the salary of a person as a specific numeric value, we can form three bins of the salaries observed in our dataset (low, medium and high) and represent the salaries of the individuals by the range it falls.

The question is now, how to determine the intervals which form the discrete bins for a random variable? There are multiple answers to this question. Some of the approaches are uninformed (also called unsupervised) in the sense that the bins we split the range of our ran-dom variable is formed without considering that the different feature values might describe such data points that belong to a different tar-get class variabley∈ Y. These simple forms of feature discretization might strive for determining such bins of feature ranges that an equal amount of observations belong into each bin. A different form of partitioning can go along the formulation of bins with equal widths.

These forms of equipartitioning approaches all have their potential drawbacks that information theory–based approaches can remedy.

A more principled way for feature discretization relies on mutual information. We can quantify with the help of mutual information the effects of performing discretization overXwhen assuming dif-ferent values as boundaries for our bins. By calculating the various mutual information scores that we get if we perform the discretiza-tion foXat various thresholds, we can select the boundary that is the most advantageous.

The mutual information–based feature discretization operates by calculating mutual information between a categorical class labelY and the discretized versions ofXthat we obtain by binning the ob-servation into discrete categories at different thresholds. The choice for the threshold providing us with the highest mutual information can be regarded as the most meaningful way to form the different discrete intervals of our initially numeric variableX. Notice that this mutual information–based approach is more informed compared to the simple unsupervised equipartitioning approach, since it also re-lies on the class labels of the observationsY, not only the distribution ofX. For this reason the mutual information–based discretization belongs to the family of informed (or supervised) discretization tech-niques.

Example3.4. Imagine that we have some numeric feature that we have 10measurements from originating from10distinct instances. The actual numeric values observed for the10data points are depicted in Figure3.9.

Besides the actual values feature X takes on for the different observations, Figure3.9also reveals their class label Y. This information is encoded by the color of the dots representing each observation.

Let us compare the cases when we form the two bins of the feature X to be

In document DATAMINING GÁBORBEREND (Pldal 46-50)