Log-loss and Tracking - Online learning: Algorithms for Big Data

DRAFT

9.3 Log-loss and Tracking

In Section 5.2we considered the problem of assigning probabilities to symbols coming from a finite set Y under the log-loss. The decision set is

D = (

p∈[0,1]^Y : X

y∈Y

p(y) = 1 )

In every round t, the environment picks a symbol yt ∈Y. The loss of prediction p∈ D is

`t(p) = −lnp(yt). Hence, L=

`(y) : `(y):D→[0,∞), `(y)(p) = −lnp(y) .

Assume we haveN experts to help us with this prediction problem, each expert predicting some probability distributionft,i ∈Dover Y in every time step. The probability distribution ft,i chosen by expert iat round t is allowed to depend on the past symbols y1, . . . , yt−1. To emphasize this, we writept(·|y1:t−1, i) instead of ft,i. Thus, the experts are identified by these maps (pt(·|·, i))t,i. This notation helped us to make a connection to Bayesian prediction. As before, the goal is to compete with the best expert in hindsight: After n rounds, the regret of a forecaster that predicts pt∈D in round t is

Rn =

t=1

`t(pt)− min

1≤i≤N n

t=1

`t(ft,i).

Select a probability mass function p0 over {1, . . . , N}. We can think of p0 as assigning the “prior” probability p0(i) to expert i. The maps (pt(·|·, i))t,i together with p0 define a probability distribution over Yⁿ, the space of sequences of length n:

P(x1, . . . , xn) =

i=1

p0(i)

t=1

p(xt|x1:t−1, i)

| {z }

P(x1,...,xn|i)

, x1, . . . , xn∈Y . (9.1)

This is a mixture of the N probability distributions P(·|1), . . . , P(·|i). Drawing X1, . . . , Xn

from this probability distribution P can be thought of picking first the index I of an expert and then drawing X1, X2, . . . sequentially from pt(·|X1:t−1, I).

As derived before, according to Bayes’ law, for any x1, . . . , xt∈Y, P(xt|x1:t−1) =

iP(xt|x1:t−1, i)P(x1:t−1|i)p0(i) P

iP(x1:t−1|i)p0(i) . We have also seen that the regret after n rounds satisfies

Rn≤lnN .

For exp-concave losses, we would get the same. If we consider the meta-experts of the previous section, this gives

Rn =O(Bln(n) +Bln(N)).

DRAFT

9.3. LOG-LOSS AND TRACKING 125

Hence, now the regret grows with the number of rounds, as opposed

Now assume that the meta-experts only use at most M base experts out of the N experts, where possibly M N:

E^n,B = (

f :n→N :

n−1

t=1

I{f(t)6=f(t+ 1)} ≤B−1,| {f(1), . . . , f(n)} | ≤M )

, We do not know which of the experts are to be used, but we assume that only a few experts are needed to get a good loss. This is a form of sparsity bias. The number of experts in this case is at most

M⁰(n, N, B, M) = N

M(n, M, B) which can be much smaller than M(n, N, B). Taking the log,

logM⁰ ≈MlnN +B(lnM+ lnn). If we were able to implement EWA, we would get

E[Rn]≤ln M

+ ln

n−1 B−1

+BlnM (9.2)

9.3.1 Specialist Framework

A partition specialist π is given by a subset W ⊂ {1, . . . , n} of awake times and an expert index 1≤m ≤N. If the wake times of some partition specialists π1, . . . , πk form a partition of {1, . . . , n}then BLAH!!?!?!?

One meta-expert can be represented by at most M specialists.

If a specialist is awake than it predicts as the underlying expert, otherwise it predicts as the algorithm. ?????

The regret against specialist (W, m), Rn(W, m) = X

t∈W

−lnP(yt|y≤t) + lnP(yt|y<t, m). Then,

Rn(W, m) =−lnP(y≤n) + lnP(y≤n|m, W). Take any distribution U over the set of specialists. Then,

Rn(U) .

= X

(W,m)

U(W, m)Rn(W, m) = X

(W,m)

U(W, m) (−lnP(y≤n) + lnP(y≤n|m, W)). U will be uniform over theM specialists representing a single meta-expert.

For two distributions over the same finite domain U, P, define the divergence to be D(U, P) =P

iU(i) ln^U(i)_P_(i). LetP be the initial distribution over the set of specialists and we predict using Bayes-rule. Applying Bayes-rule, we get

Rn(U) =D(U, P)−D(U, P(·|y≤n)).

DRAFT

126 CHAPTER 9. PREDICTION WITH MANY EXPERTS But can we calculate the Bayes-update? By our convention that the specialist predict as the algorithm when they are asleep,

P(yt|y<t) = X

(W,m):t∈W

P(yt|y<t, m)P((W, m)|y<t) + X

(W,m):t6∈W

P(yt|y<t)P((W, m)|y<t)

Solving the equation for P(yt|y<t) we get P(yt|y<t) =

(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) 1−P

(W,m):t6∈WP((W, m)|y<t)

= P

(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) P

(W,m):t∈W P((W, m)|y<t) .

================== Lecture on Oct 28: ==============

Prediction with log-loss N experts, predictions of mth expert: p(yt|y<t, m). The losses are:

Lbn = lnp(y≤n), Ln,m =−lnp(y≤n|m).

Choose U(m) – a distribution over the experts. Define the weighted regret:

Rn(U) =

m=1

U(m) (−lnp(y≤n) + lnp(y≤n|m)). By Bayes theorem,

p(y≤n|m) = p(y≤np(m|y≤n) p(m) and substitution,

Rn(U) =

i=1

U(m) (−lnp(y≤n) + lnp(y≤n)−lnp(m) + lnp(m|y≤n))

i=1

U(m) lnU(m)

p(m) −U(m) ln U(m) p(m|y≤n)

= KL (Ukp)−KL (Ukp(m =·|y≤n))

≤KL (Ukp) . .

where in the last line we used that the KL divergences are nonnegative (cf. Exercise 9.1).

Now, up to the specialist framework.

LetWtbe the set of specialists who are awake at time t. Put a priorp0 over the specialists.

Define

Predt(yt) = P

s∈Wtp(yt|y<t, s)p(s|y<t) p(Wt|y<t)

DRAFT

9.3. LOG-LOSS AND TRACKING 127

Let the extended specialist based on specialist s predict Predt whens is sleeping at time t and predict like specialist s otherwise. Let Pred(yt|y<t, s) denote the prediction of such an extended specialist. Define p⁰ using these extended specialists. Then, one can show that

Predt(y) =p⁰(yt|y<t).

For simplicity, we will use p in place ofp⁰. Regret against (x, m):

Rn(x, m) = X

t:xt=w

−ln Predt(yt)+lnp(yt|y<t, m) =

t=1

−ln Predt(yt)+ln Pred(yt|y<t,(x, m)). Weighted regret:

(x,m)

U((x, m))Rn(x, m)≤KL (Ukp0).

Going back to tracking a small, but unknown number of experts. Time is partitioned into at most B segments. A meta-expert is defined by b−1 switch points with b ≤B and expert indices i1, . . . , ib ∈ {1, . . . , N}, where ij 6=ij+1, 1≤j ≤b−1.

We consider only the case where i1, . . . , ib belong to a fixed subset E ⊂ {1, . . . , N} of cardinality at most M ≤N.

Solving tracking by specialists!

Each meta-expert e can be equivalently represented by at most M partition specialist whose set we denote by Se: To each expert index in{i1, . . . , ib} we define a specialist who will be awake exactly when the corresponding expert is used by the meta-expert e. Choose p0 to be a distribution over the set ∪^eSe.

The cumulated loss of this meta expert is the same as the sum of the cumulated loss of the specialists defined for it: Ln,e=P

s∈S_eLn(s).

DefineUe(·) to be the uniform distribution overSe. The algorithm that uses the extended specialists will have a regret

Lbn−Ln,e =|Se|X

s∈S_e

Ue(s)Rn(s)≤ |Se|KL (Uekp0) . Choose

p0((x, m)) =p0(x)p0(m) =p0(x)/N . For simplicity assume |Se|=M. Then,

MKL (Uekp0) =Mln N

M + X

(x,m)∈Se

ln 1 p0(x).

It remains to select p0 over the wake-sleep patterns so that the regret is small, yet we can calculate the updates efficiently. We will choose p0 =p0(x) to be a Markovian prior:

p0(x) =θ(x1)θ(x2|x1). . . θ(xn|xn−1) (recall that x= (x1, . . . , xn)∈ {s, w}ⁿ. We have

(x,m)∈Se

ln 1

p0(x) =−ln Y

(x,m)∈Se

p0(x).

DRAFT

128 CHAPTER 9. PREDICTION WITH MANY EXPERTS We need a lower bound on Q

(x,m)∈Sep0(x). Reordering the terms in p0, exploiting that due to the definition of Se, exactly one specialists is awake at any time,

(x,m)∈Se

p0(x) =θ(w)θ(s)^M⁻¹θ(s|s)^(M−1)(n−1)−(B−1)θ(w|w)(n−1)−(B−1)θ(w|s)^(B−1)θ(s|w)^(B−1).

Now, optimizing θ gives θ(w) = 1/M,θ(s|w) = ^B−1_n−1 and θw|s =???

Then, Rn≤Mln N

M +Mln 1

M + (n−1)h

B−1 n−1

+ (M −1)(T −1)h

B −1 (n−1)(M−1)

, where

h(p) =−plnp−(1−p) ln(1−p)

is the binary entropy function. We can simplify this by using nln(k/n)≤kln(n/k) +k, Rn≤Mln N

M + 2(B−1) ln n−1

B−1 +BlnM + 2B . Compare this with (9.2). We loose a factor of 2 essentially.

Implementation: Let b ∈ {w, s} and define

vt(b, m) = p(b, m|y<t) = X

x:xt=b

p((x, m)|y<t) where

v1(b, m) = θ(b) N . Round t: Receive p(yt|y<t, m). Predict:

Pred⁰_t(yt) =X

p(yt|y<t, m)vt(m|w), where

vt(m|w) = vt(w, m) P

m⁰vt(w, m⁰).

Let us show that vt can be efficiently computed. A little calculation shows vt+1(b, m) = θ(b|w)p(yt|y<t, m)

Predt(yt) vt(w, m) +θ(b|s)vt(b, m).

One can show that Predt = Pred⁰_t holds for all t = 1, . . . , n. The proof of this hinges upon that

p((x, m)|y≤t) = (_p(y

t|y<t,(m,x))p((x,m)|y<t)

Predt(yt , if xt=w;

p((x, m)|y<t), otherwise.

Notice that the vt update has complexity O(N) and storing these values take memory O(N).

DRAFT

9.4. BIBLIOGRAPHIC REMARKS 129

9.4 Bibliographic Remarks

In document Online learning: Algorithms for Big Data (Pldal 124-129)