Markov processes

In document DATAMINING GÁBORBEREND (Pldal 171-177)

? 6.1 The curse of dimensionality

Exercise 7.1. Suppose you have a collection of recipes including a list of ingredients required for them. In case you would like to find recipes that are

8.2 Markov processes

Markov processes are often used to model real world events of se-quential nature. For instance, a Markov model can be applied for modeling weather conditions or the outcome of sports events. In many cases, modeling sequential events is straightforward by deter-mining a square matrix of transitional probabilitiesM, such that its entrymijquantifies the probability of observing thejthpossible out-come as our upcoming observation, given that we currently observe outcomei.

The main simplifying assumption in Markovian modeling is that the next upcoming thing we observe is not dependent onallour ob-servations from the past, but it only depends upon our last (few) observations, i.e., we operate with a limited memory. It is like if we were making our weather forecast for the upcoming day purely based on the weather conditions we experience today. This is obvi-ously a simplification in our modeling, but it also make sense not to condition our predictions for the future on events that had happened in the distant past.

To put it more formally, suppose we have some random variable Xwith possible outcomes indicated as{1, 2, . . . ,n}. By denoting our current observation towards random variableXat timetasXt, what the Markovian assumption says is that

P(Xt=j|Xt1=i, . . . ,X0=x0) =P(Xt= j|Xt1=i) =mij. Naturally, since we would like to treat the elements ofMas probabil-ities and every row of matrixMas a probability distribution, it has to hold that

mi,j ≥0,∀1≤i,j≤n,

n j=1

mi,j =1,∀1in. (8.1)

Those matricesMthat obey the properties described in8.1are called row stochastic matrices. As mentioned earlier, entrymijfrom such a matrix can be vied as the probability of observing the outcome of our random variable to bejimmediately after an observation towards outcomei.

Example8.1. The next matrices qualify as row stochastic matrices:


. On the other hand, matrices


do not count as row stochastic matrices (although the last one is a column stochastic matrix).

Matrices obeying the properties from8.1are called row stochas-tic matrices. A similar set of criteria can also be given for column stochastic matrices.

For stochastic matrices (let them be row or column stochastic ma-trices), we always have1.0as their largest eigenvalue. To see this simply look at the matrix-vector product obtained as M1for any row stochastic matrixM, with1denoting the vector consisting of all ones.

Due to the fact that every row ofMdefines a probability distribution, we have thatM1=1, meaning that 1 is always one of the eigenvalues of any (row) stochastic matrix.

It turns out that the previously seen eigenvalue of one is always go-ing to be the largest among the eigenvalues of any stochastic matrix.

In order to see why this is the case, let us define thetracematrix operator, which simply sums alls the elements of a matrix along its main diagonal. That is for some matrixM,

trace(M) =

n i=1


From this previous definition, it follows that the trace of any n-by-nstochastic matrix is always smaller than or equal ton. Another property of the trace is that it can be expressed as the sum of the eigenvalues of the matrix. That is,

n i=1


n i=1


What this means that in the 2-by-2 case – given thatλ1 = 1,λ2 < 1 has to hold. From that point, we can use induction to see that for any n-by-nstochastic matrix we would have 1 as the largest (principal) eigenvalue.


Figure8.3: Eigenvalues of stochastic matrices

8.2.1 The stationary distribution of Markov chains

Astochastic vector pRn0is such a vector for which the sum of entries sums up to one, i.e., which forms a valid distribution. We can express a probabilistic belief over the different states of a Markov chain.

Notice that when we calculatepMfor some stochastic vectorp and row stochastic matrixM, we are essentially expressing the prob-ability distribution over the different states of our Markov process for the next time step – relative to the actual time step based on our current beliefs for observing the individual states, as expressed byp.

data m i n i n g f ro m n e t w o r k s 173

In order to make the dependence of our stochastic belief vector to the individual time step in the simulation, we will introducep(t)for making a reference to the stochastic vector from time stept. That is, p(t+1) =p(t)M. If we denote our initial configuration asp(0), we can notice thatp(t)could just as well be expressed asp(0)Mt, becausep(t) is recursively entangled withp(0).

It turns out, that – under mild assumptions that we shall discuss later – there exists a uniquesteady state distribution, also called as astationary distribution, for which Markov chains described by a particular state transition matrix converge to. This distribution tells us the long term observation probability of the individual states of our Markov chain.

The stationary distribution for a Markov chain described by a row stochastic matrixMis the distributionpfulfilling the equation

p= lim


In other words,pis the fixed point for the operation of right multi-plication with the row stochastic matrixM. That is,


which implies thatpis the left eigenvector ofMcorresponding to the eigenvalue1. For row stochastic matrices, we can always be sure that eigenvalue1is among its eigenvalues and that this is going to be the largest eigenvalue (often called as theprincipal eigenvalue) for the given matrix.

Finding the eigenvector belonging to the principal eigenvalue is as simply as choosing an initial stochastic vectorpand keep it mul-tiplied by the row stochastic matrixMuntil there is no substantial change in the resulting vector. This simple procedure is named the power methodand it is illustrated in Algorithm3. The algorithm is guaranteed to converge to the same principal eigenvector no matter how we choose our initial stochastic vectorp.

Algorithm3: The power algorithm for determining the principal eigenvector of matrixM.

Require: Input matrix M∈Rn×n and a tolerance thresholde Ensure: principal eigenvectorpRn

1: functionPo w e rIt e r at i o n(M,e)

2: p=h1nin

i=1 // ann-element vector consisting of all 1n 3: whilekppMk2>edo

4: p=pM

5: endwhile

6: return p

7: endfunction

8.2.2 Markov processes and random walks — through a concrete example For a concrete example, let us assume that we would like to model the weather conditions of a town by a Markov chain. The three dif-ferent weather conditions we can make are Sunny (observation #1, abbreviated as S), Overcast (observation #2, abbreviated as O) and Rainy (observation #3, abbreviated as R).

Hence, we haven = 3, i.e., the number of different observations for the random variable describing the weather of the town has three possible outcomes. What it further implies that the Markov chain can be described as a 3-by-3 row stochastic matrixM, which tells us the probability for observing a pair of weather conditions for two consecutive days in all possible combinations (that is sunny-sunny, sunny-overcast, sunny-rainy, overcast-sunny-sunny, overcast-overcast, overcast-rainy, rainy-sunny, rainy-overcast, rainy-rainy).

The transition probabilities needed for matrixMcan be simply obtained by taking the maximum likelihood estimate (MLE) for any pair of observations. All we need to do to obtain MLE estimates for transition probabilities is that we keep a long track of the weather conditions and simply rely on the definition of the conditional proba-bility stating that

P(A|B) = P(A,B) P(B) .

The definition of conditional probability employed for the de-termination of transition probabilities of our Markov chain simply means that for a certain pair of weather combination(wi,wj)in order to quantify the probabilityP(wj|wi)what we have to do is to count the number of times we observed that a day with weather condi-tionwiwas followed by weather conditionwj and divide that by the number of days weather conditionwi was observed (irrespective of the weather condition for the upcoming day). Suppose we have ob-served the weather to be sunny5,000times and that we have seen the weather after a sunny day to be overcast or rainy500-500times each.

This also implies that for the remaining5,000-500-500=4,000cases, a sunny day was followed by another sunny day. In such a case, we would infer the following transition probabilities:

P(sunny|sunny) = 4000 5000 =0.8 P(overcast|sunny) = 500

5000 =0.1 P(rainy|sunny) = 500

5000 =0.1.

In a likewise manner, one could collect all the MLE estimates for the transition probabilities for a Markov chain. Suppose we did so,

data m i n i n g f ro m n e t w o r k s 175

and obtained the matrix of transition probabilities between the three possible states of our Markov chain as


0.8 0.1 0.1 0.2 0.4 0.4 0.2 0.1 0.7

. (8.2)

A state transition matrix, such as the one above, can be also imag-ined graphically. Figure8.4provides such a network visualization of our Markov chain. In the network interpretation of a Markov chain, each state of our random process corresponds to a node in the network and state transition probabilities correspond to the edge weights that tell us the probability for traversing from a certain start node to a particular neighboring node. The row stochasticity of the transition matrix in the network view can be seen as the sum of the weight for the outgoing edges of a particular node always sum to one.

We can now rely on this graphical representation of the state tran-sition probability matrix as a way for modeling stochastic processes that are described as a Markov process in the form of random walks.

Arandom walkdescribes a stochastic temporal process in which one picks a state of the Markov process sequentially and stochasti-cally. That is, when the random walk is at a given state, it chooses its next state randomly, proportional to the state transition probabilities described in the matrix of transition probabilities. The stationary dis-tribution in that view can be interpreted as fraction of time a random walk spends at each state of the Markov chain.

In our concrete example – where the weather condition is de-scribed by a Markov process with three states – we haven = 3. A stochastic vectorp= [1, 0, 0]can be interpreted as observing weather condition #1with absolute certainty, i.e., a sunny weather. Vector p = [0.5, 0.14, 0.36]on the other hand describes a stochastic weather configuration, where the sunny weather is the most likely outcome (50%), followed by a rainy weather (36%) and the overcast weather condition being regarded the least likely (14%).

In Figure8.4, the radius of the nodes – each marking one of the states from the Markov chain – is drawn proportionally to the sta-tionary distribution of the various states. Figure8.5contains a sam-ple code for approximating the principal eigenvector of our state transition probability matrixM. The output of the sample code also illustrates theglobal convergenceof the stationary distribution, since the different initializations forp (the rows of matrixP) all converged to a very similar solution already in10iterations. If the number of iterations was increased, the resulting vectors would resemble each other even more in the end.



0.8 0.1

0.1 0.2



0.2 0.1


Figure8.4: An illustration of the Markov chain given by the transi-tion probabilities from8.2. The radii of the nodes – corresponding to the different states of our Markov chain – if proportional to their stationary distribution.

In our concrete case, way we should interpret the stationary dis-tributionp = [0.5, 0.14, 0.36]such that we would expect to see half of the days to be sunny and that there are approximately2.5times as many rainy days compared to overcast dayson the long rungiven that our state transition probabilities accurately approximate the conditional probabilities between the weather conditions of two con-secutive days.

M = [0.8 0.1 0.1;

0.2 0.4 0.4;

0.2 0.1 0.7]; % the state transition probability matrix

rand("seed", 42) % fixing the random seed

P = rand(5, 3); % generate 5 random vectors from [0;1]

P = P ./ sum(P, 2); % turn rows into probability distributions

for k=1:15

P = P*M; % power iteration for 15 steps endfor


>> 0.49984 0.14286 0.35731 0.49984 0.14286 0.35731 0.49991 0.14286 0.35723 0.49996 0.14286 0.35718 0.49995 0.14286 0.35719


Figure8.5: The uniqueness of the stationary distribution and its global convergence property is illustrated by the fact that all5random initializations forpconverged to a very similar distribution.

data m i n i n g f ro m n e t w o r k s 177

In document DATAMINING GÁBORBEREND (Pldal 171-177)