** ? 6.1 The curse of dimensionality**

**Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are**

**8.2 Markov processes**

Markov processes are often used to model real world events of
se-quential nature. For instance, a Markov model can be applied for
modeling weather conditions or the outcome of sports events. In
many cases, modeling sequential events is straightforward by
deter-mining a square matrix of transitional probabilitiesM, such that its
entrymijquantifies the probability of observing thej^{th}possible
out-come as our upcoming observation, given that we currently observe
outcomei.

The main simplifying assumption in Markovian modeling is that the next upcoming thing we observe is not dependent onallour ob-servations from the past, but it only depends upon our last (few) observations, i.e., we operate with a limited memory. It is like if we were making our weather forecast for the upcoming day purely based on the weather conditions we experience today. This is obvi-ously a simplification in our modeling, but it also make sense not to condition our predictions for the future on events that had happened in the distant past.

To put it more formally, suppose we have some random variable Xwith possible outcomes indicated as{1, 2, . . . ,n}. By denoting our current observation towards random variableXat timetasXt, what the Markovian assumption says is that

P(Xt=j|^{X}t−1=i, . . . ,X_{0}=x_{0}) =P(Xt= j|^{X}t−1=i) =m_{ij}.
Naturally, since we would like to treat the elements ofMas
probabil-ities and every row of matrixMas a probability distribution, it has to
hold that

mi,j ≥0,∀1≤i,j≤n,

### ∑

n j=1m_{i,j} =1,∀^{1}≤^{i}≤^{n.} ^{(}^{8}^{.}^{1}^{)}

Those matricesMthat obey the properties described in8.1are called
**row stochastic matrices. As mentioned earlier, entry**m_{ij}from such a
matrix can be vied as the probability of observing the outcome of our
random variable to bejimmediately after an observation towards
outcomei.

**Example8.1.** The next matrices qualify as row stochastic matrices:

"

. On the other hand, matrices

"

do not count as row stochastic matrices (although the last one is a column stochastic matrix).

Matrices obeying the properties from8.1are called row stochas-tic matrices. A similar set of criteria can also be given for column stochastic matrices.

For stochastic matrices (let them be row or column stochastic
ma-trices), we always have1.0as their largest eigenvalue. To see this
simply look at the matrix-vector product obtained as M1for any row
stochastic matrixM, with**1**denoting the vector consisting of all ones.

Due to the fact that every row ofMdefines a probability distribution,
we have thatM1=**1, meaning that 1 is always one of the eigenvalues**
of any (row) stochastic matrix.

It turns out that the previously seen eigenvalue of one is always go-ing to be the largest among the eigenvalues of any stochastic matrix.

In order to see why this is the case, let us define thetracematrix operator, which simply sums alls the elements of a matrix along its main diagonal. That is for some matrixM,

trace(M) =

### ∑

n i=1m_{ii}.

From this previous definition, it follows that the trace of any n-by-nstochastic matrix is always smaller than or equal ton. Another property of the trace is that it can be expressed as the sum of the eigenvalues of the matrix. That is,

### ∑

n i=1*λ*_{i}=

### ∑

n i=1mii=trace(M)<n.

What this means that in the 2-by-2 case – given that*λ*_{1} = 1,*λ*_{2} < 1
has to hold. From that point, we can use induction to see that for any
n-by-nstochastic matrix we would have 1 as the largest (principal)
eigenvalue.

**M****ATH****R****EVIEW****| E****IGENVALUES OF STOCHASTIC MATRICES**

Figure8.3: Eigenvalues of stochastic matrices

*8.2.1* The stationary distribution of Markov chains

A**stochastic vector p** ∈ **R**^{n}_{≥}_{0}is such a vector for which the sum of
entries sums up to one, i.e., which forms a valid distribution. We can
express a probabilistic belief over the different states of a Markov
chain.

Notice that when we calculate**pM**for some stochastic vector**p**
and row stochastic matrixM, we are essentially expressing the
prob-ability distribution over the different states of our Markov process
for the next time step – relative to the actual time step based on our
current beliefs for observing the individual states, as expressed by**p.**

data m i n i n g f ro m n e t w o r k s 173

In order to make the dependence of our stochastic belief vector to
the individual time step in the simulation, we will introduce**p**^{(t)}for
making a reference to the stochastic vector from time stept. That is,
**p**^{(t+1)} =**p**^{(t)}M. If we denote our initial configuration as**p**^{(0)}, we can
notice that**p**^{(t)}could just as well be expressed as**p**^{(0)}M^{t}, because**p**^{(t)}
is recursively entangled with**p**^{(0)}.

It turns out, that – under mild assumptions that we shall discuss
later – there exists a unique**steady state distribution, also called as**
a**stationary distribution, for which Markov chains described by a**
particular state transition matrix converge to. This distribution tells
us the long term observation probability of the individual states of
our Markov chain.

The stationary distribution for a Markov chain described by a row
stochastic matrixMis the distribution**p**^{∗}fulfilling the equation

**p**^{∗}= lim

t→∞**p**^{(0)}M^{t}.

In other words,**p**^{∗}is the fixed point for the operation of right
multi-plication with the row stochastic matrixM. That is,

**p**^{∗}=**p**^{∗}M,

which implies that**p**^{∗}is the left eigenvector ofMcorresponding to
the eigenvalue1. For row stochastic matrices, we can always be sure
that eigenvalue1is among its eigenvalues and that this is going to be
the largest eigenvalue (often called as the**principal eigenvalue) for**
the given matrix.

Finding the eigenvector belonging to the principal eigenvalue is
as simply as choosing an initial stochastic vector**p**and keep it
mul-tiplied by the row stochastic matrixMuntil there is no substantial
change in the resulting vector. This simple procedure is named the
**power method**and it is illustrated in Algorithm3. The algorithm is
guaranteed to converge to the same principal eigenvector no matter
how we choose our initial stochastic vector**p.**

**Algorithm**^{3}: The power algorithm for
determining the principal eigenvector
of matrixM.

**Require:** Input matrix M∈**R**^{n}^{×}^{n} and a tolerance threshold*e*
**Ensure:** principal eigenvector**p**∈**R**^{n}

1: **function**Po w e rIt e r at i o n(M,*e)*

2: **p**=^{h}^{1}_{n}^{i}^{n}

i=1 // ann-element vector consisting of all ^{1}_{n}
3: **while**k** ^{p}**−

**k2>**

^{pM}*e*

**do**

4: **p**=**pM**

5: **endwhile**

6: **return p**

7: **endfunction**

*8.2.2* Markov processes and random walks — through a concrete example
For a concrete example, let us assume that we would like to model
the weather conditions of a town by a Markov chain. The three
dif-ferent weather conditions we can make are Sunny (observation #1,
abbreviated as S), Overcast (observation #2, abbreviated as O) and
Rainy (observation #3, abbreviated as R).

Hence, we haven = 3, i.e., the number of different observations for the random variable describing the weather of the town has three possible outcomes. What it further implies that the Markov chain can be described as a 3-by-3 row stochastic matrixM, which tells us the probability for observing a pair of weather conditions for two consecutive days in all possible combinations (that is sunny-sunny, sunny-overcast, sunny-rainy, overcast-sunny-sunny, overcast-overcast, overcast-rainy, rainy-sunny, rainy-overcast, rainy-rainy).

The transition probabilities needed for matrixMcan be simply obtained by taking the maximum likelihood estimate (MLE) for any pair of observations. All we need to do to obtain MLE estimates for transition probabilities is that we keep a long track of the weather conditions and simply rely on the definition of the conditional proba-bility stating that

P(A|^{B}) = ^{P}(A,B)
P(B) ^{.}

The definition of conditional probability employed for the
de-termination of transition probabilities of our Markov chain simply
means that for a certain pair of weather combination(wi,wj)in order
to quantify the probabilityP(wj|wi)what we have to do is to count
the number of times we observed that a day with weather
condi-tionw_{i}was followed by weather conditionw_{j} and divide that by the
number of days weather conditionw_{i} was observed (irrespective of
the weather condition for the upcoming day). Suppose we have
ob-served the weather to be sunny5,000times and that we have seen the
weather after a sunny day to be overcast or rainy500-500times each.

This also implies that for the remaining5,000-500-500=4,000cases, a sunny day was followed by another sunny day. In such a case, we would infer the following transition probabilities:

P(sunny|^{sunny}) = ^{4000}
5000 =0.8
P(overcast|sunny) = ^{500}

5000 =0.1
P(rainy|^{sunny}) = ^{500}

5000 =0.1.

In a likewise manner, one could collect all the MLE estimates for the transition probabilities for a Markov chain. Suppose we did so,

data m i n i n g f ro m n e t w o r k s 175

and obtained the matrix of transition probabilities between the three possible states of our Markov chain as

M=

0.8 0.1 0.1 0.2 0.4 0.4 0.2 0.1 0.7

. (8.2)

A state transition matrix, such as the one above, can be also imag-ined graphically. Figure8.4provides such a network visualization of our Markov chain. In the network interpretation of a Markov chain, each state of our random process corresponds to a node in the network and state transition probabilities correspond to the edge weights that tell us the probability for traversing from a certain start node to a particular neighboring node. The row stochasticity of the transition matrix in the network view can be seen as the sum of the weight for the outgoing edges of a particular node always sum to one.

We can now rely on this graphical representation of the state tran-sition probability matrix as a way for modeling stochastic processes that are described as a Markov process in the form of random walks.

A**random walk**describes a stochastic temporal process in which
one picks a state of the Markov process sequentially and
stochasti-cally. That is, when the random walk is at a given state, it chooses its
next state randomly, proportional to the state transition probabilities
described in the matrix of transition probabilities. The stationary
dis-tribution in that view can be interpreted as fraction of time a random
walk spends at each state of the Markov chain.

In our concrete example – where the weather condition is
de-scribed by a Markov process with three states – we haven = 3. A
stochastic vector**p**= [1, 0, 0]can be interpreted as observing weather
condition #1with absolute certainty, i.e., a sunny weather. Vector
**p** = [0.5, 0.14, 0.36]on the other hand describes a stochastic weather
configuration, where the sunny weather is the most likely outcome
(50%), followed by a rainy weather (36%) and the overcast weather
condition being regarded the least likely (14%).

In Figure8.4, the radius of the nodes – each marking one of the
states from the Markov chain – is drawn proportionally to the
sta-tionary distribution of the various states. Figure8.5contains a
sam-ple code for approximating the principal eigenvector of our state
transition probability matrixM. The output of the sample code also
illustrates the**global convergence**of the stationary distribution, since
the different initializations for**p**^{∗} (the rows of matrixP) all converged
to a very similar solution already in10iterations. If the number of
iterations was increased, the resulting vectors would resemble each
other even more in the end.

S O

R

0.8 0.1

0.1 0.2

0.4

0.4

0.2 0.1

0.7

Figure8.4: An illustration of the Markov chain given by the transi-tion probabilities from8.2. The radii of the nodes – corresponding to the different states of our Markov chain – if proportional to their stationary distribution.

In our concrete case, way we should interpret the stationary
dis-tribution**p**^{∗} = [0.5, 0.14, 0.36]such that we would expect to see half
of the days to be sunny and that there are approximately2.5times
as many rainy days compared to overcast dayson the long rungiven
that our state transition probabilities accurately approximate the
conditional probabilities between the weather conditions of two
con-secutive days.

M = [0.8 0.1 0.1;

0.2 0.4 0.4;

0.2 0.1 0.7]; % the state transition probability matrix

rand("seed", 42) % fixing the random seed

P = rand(5, 3); % generate 5 random vectors from [0;1]

P = P ./ sum(P, 2); % turn rows into probability distributions

for k=1:15

P = P*M; % power iteration for 15 steps endfor

disp(P)

>> 0.49984 0.14286 0.35731 0.49984 0.14286 0.35731 0.49991 0.14286 0.35723 0.49996 0.14286 0.35718 0.49995 0.14286 0.35719

**C****ODE SNIPPET**

Figure8.5: The uniqueness of the
stationary distribution and its global
convergence property is illustrated by
the fact that all5random initializations
for**p**converged to a very similar
distribution.

data m i n i n g f ro m n e t w o r k s 177