• Nem Talált Eredményt

Log-loss and Tracking

In document Online learning: Algorithms for Big Data (Pldal 124-129)

DRAFT

9.3 Log-loss and Tracking

In Section 5.2we considered the problem of assigning probabilities to symbols coming from a finite set Y under the log-loss. The decision set is

D = (

p∈[0,1]Y : X

y∈Y

p(y) = 1 )

.

In every round t, the environment picks a symbol yt ∈Y. The loss of prediction p∈ D is

`t(p) = −lnp(yt). Hence, L=

`(y) : `(y):D→[0,∞), `(y)(p) = −lnp(y) .

Assume we haveN experts to help us with this prediction problem, each expert predicting some probability distributionft,i ∈Dover Y in every time step. The probability distribution ft,i chosen by expert iat round t is allowed to depend on the past symbols y1, . . . , yt−1. To emphasize this, we writept(·|y1:t−1, i) instead of ft,i. Thus, the experts are identified by these maps (pt(·|·, i))t,i. This notation helped us to make a connection to Bayesian prediction. As before, the goal is to compete with the best expert in hindsight: After n rounds, the regret of a forecaster that predicts pt∈D in round t is

Rn =

n

X

t=1

`t(pt)− min

1≤i≤N n

X

t=1

`t(ft,i).

Select a probability mass function p0 over {1, . . . , N}. We can think of p0 as assigning the “prior” probability p0(i) to expert i. The maps (pt(·|·, i))t,i together with p0 define a probability distribution over Yn, the space of sequences of length n:

P(x1, . . . , xn) =

N

X

i=1

p0(i)

n

Y

t=1

p(xt|x1:t−1, i)

| {z }

P(x1,...,xn|i)

, x1, . . . , xn∈Y . (9.1)

This is a mixture of the N probability distributions P(·|1), . . . , P(·|i). Drawing X1, . . . , Xn

from this probability distribution P can be thought of picking first the index I of an expert and then drawing X1, X2, . . . sequentially from pt(·|X1:t−1, I).

As derived before, according to Bayes’ law, for any x1, . . . , xt∈Y, P(xt|x1:t−1) =

P

iP(xt|x1:t−1, i)P(x1:t−1|i)p0(i) P

iP(x1:t−1|i)p0(i) . We have also seen that the regret after n rounds satisfies

Rn≤lnN .

For exp-concave losses, we would get the same. If we consider the meta-experts of the previous section, this gives

Rn =O(Bln(n) +Bln(N)).

DRAFT

9.3. LOG-LOSS AND TRACKING 125

Hence, now the regret grows with the number of rounds, as opposed

Now assume that the meta-experts only use at most M base experts out of the N experts, where possibly M N:

En,B = (

f :n→N :

n−1

X

t=1

I{f(t)6=f(t+ 1)} ≤B−1,| {f(1), . . . , f(n)} | ≤M )

, We do not know which of the experts are to be used, but we assume that only a few experts are needed to get a good loss. This is a form of sparsity bias. The number of experts in this case is at most

M0(n, N, B, M) = N

M

M(n, M, B) which can be much smaller than M(n, N, B). Taking the log,

logM0 ≈MlnN +B(lnM+ lnn). If we were able to implement EWA, we would get

E[Rn]≤ln M

N

+ ln

n−1 B−1

+BlnM (9.2)

9.3.1 Specialist Framework

A partition specialist π is given by a subset W ⊂ {1, . . . , n} of awake times and an expert index 1≤m ≤N. If the wake times of some partition specialists π1, . . . , πk form a partition of {1, . . . , n}then BLAH!!?!?!?

One meta-expert can be represented by at most M specialists.

If a specialist is awake than it predicts as the underlying expert, otherwise it predicts as the algorithm. ?????

The regret against specialist (W, m), Rn(W, m) = X

t∈W

−lnP(yt|y≤t) + lnP(yt|y<t, m). Then,

Rn(W, m) =−lnP(y≤n) + lnP(y≤n|m, W). Take any distribution U over the set of specialists. Then,

Rn(U) .

= X

(W,m)

U(W, m)Rn(W, m) = X

(W,m)

U(W, m) (−lnP(y≤n) + lnP(y≤n|m, W)). U will be uniform over theM specialists representing a single meta-expert.

For two distributions over the same finite domain U, P, define the divergence to be D(U, P) =P

iU(i) lnU(i)P(i). LetP be the initial distribution over the set of specialists and we predict using Bayes-rule. Applying Bayes-rule, we get

Rn(U) =D(U, P)−D(U, P(·|y≤n)).

DRAFT

126 CHAPTER 9. PREDICTION WITH MANY EXPERTS But can we calculate the Bayes-update? By our convention that the specialist predict as the algorithm when they are asleep,

P(yt|y<t) = X

(W,m):t∈W

P(yt|y<t, m)P((W, m)|y<t) + X

(W,m):t6∈W

P(yt|y<t)P((W, m)|y<t)

Solving the equation for P(yt|y<t) we get P(yt|y<t) =

P

(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) 1−P

(W,m):t6∈WP((W, m)|y<t)

= P

(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) P

(W,m):t∈W P((W, m)|y<t) .

================== Lecture on Oct 28: ==============

Prediction with log-loss N experts, predictions of mth expert: p(yt|y<t, m). The losses are:

Lbn = lnp(y≤n), Ln,m =−lnp(y≤n|m).

Choose U(m) – a distribution over the experts. Define the weighted regret:

Rn(U) =

N

X

m=1

U(m) (−lnp(y≤n) + lnp(y≤n|m)). By Bayes theorem,

p(y≤n|m) = p(y≤np(m|y≤n) p(m) and substitution,

Rn(U) =

n

X

i=1

U(m) (−lnp(y≤n) + lnp(y≤n)−lnp(m) + lnp(m|y≤n))

=

n

X

i=1

U(m) lnU(m)

p(m) −U(m) ln U(m) p(m|y≤n)

= KL (Ukp)−KL (Ukp(m =·|y≤n))

≤KL (Ukp) . .

where in the last line we used that the KL divergences are nonnegative (cf. Exercise 9.1).

Now, up to the specialist framework.

A partition specialist π is given by a subset W ⊂ {1, . . . , n} of awake times and an expert index 1≤m ≤N. If the wake times of some partition specialists π1, . . . , πk form a partition of {1, . . . , n}then BLAH!!?!?!?

LetWtbe the set of specialists who are awake at time t. Put a priorp0 over the specialists.

Define

Predt(yt) = P

s∈Wtp(yt|y<t, s)p(s|y<t) p(Wt|y<t)

DRAFT

9.3. LOG-LOSS AND TRACKING 127

Let the extended specialist based on specialist s predict Predt whens is sleeping at time t and predict like specialist s otherwise. Let Pred(yt|y<t, s) denote the prediction of such an extended specialist. Define p0 using these extended specialists. Then, one can show that

Predt(y) =p0(yt|y<t).

For simplicity, we will use p in place ofp0. Regret against (x, m):

Rn(x, m) = X

t:xt=w

−ln Predt(yt)+lnp(yt|y<t, m) =

n

X

t=1

−ln Predt(yt)+ln Pred(yt|y<t,(x, m)). Weighted regret:

X

(x,m)

U((x, m))Rn(x, m)≤KL (Ukp0).

Going back to tracking a small, but unknown number of experts. Time is partitioned into at most B segments. A meta-expert is defined by b−1 switch points with b ≤B and expert indices i1, . . . , ib ∈ {1, . . . , N}, where ij 6=ij+1, 1≤j ≤b−1.

We consider only the case where i1, . . . , ib belong to a fixed subset E ⊂ {1, . . . , N} of cardinality at most M ≤N.

Solving tracking by specialists!

Each meta-expert e can be equivalently represented by at most M partition specialist whose set we denote by Se: To each expert index in{i1, . . . , ib} we define a specialist who will be awake exactly when the corresponding expert is used by the meta-expert e. Choose p0 to be a distribution over the set ∪eSe.

The cumulated loss of this meta expert is the same as the sum of the cumulated loss of the specialists defined for it: Ln,e=P

s∈SeLn(s).

DefineUe(·) to be the uniform distribution overSe. The algorithm that uses the extended specialists will have a regret

Lbn−Ln,e =|Se|X

s∈Se

Ue(s)Rn(s)≤ |Se|KL (Uekp0) . Choose

p0((x, m)) =p0(x)p0(m) =p0(x)/N . For simplicity assume |Se|=M. Then,

MKL (Uekp0) =Mln N

M + X

(x,m)∈Se

ln 1 p0(x).

It remains to select p0 over the wake-sleep patterns so that the regret is small, yet we can calculate the updates efficiently. We will choose p0 =p0(x) to be a Markovian prior:

p0(x) =θ(x1)θ(x2|x1). . . θ(xn|xn−1) (recall that x= (x1, . . . , xn)∈ {s, w}n. We have

X

(x,m)∈Se

ln 1

p0(x) =−ln Y

(x,m)∈Se

p0(x).

DRAFT

128 CHAPTER 9. PREDICTION WITH MANY EXPERTS We need a lower bound on Q

(x,m)∈Sep0(x). Reordering the terms in p0, exploiting that due to the definition of Se, exactly one specialists is awake at any time,

Y

(x,m)∈Se

p0(x) =θ(w)θ(s)M−1θ(s|s)(M−1)(n−1)−(B−1)θ(w|w)(n−1)−(B−1)θ(w|s)(B−1)θ(s|w)(B−1).

Now, optimizing θ gives θ(w) = 1/M,θ(s|w) = B−1n−1 and θw|s =???

Then, Rn≤Mln N

M +Mln 1

M + (n−1)h

B−1 n−1

+ (M −1)(T −1)h

B −1 (n−1)(M−1)

, where

h(p) =−plnp−(1−p) ln(1−p)

is the binary entropy function. We can simplify this by using nln(k/n)≤kln(n/k) +k, Rn≤Mln N

M + 2(B−1) ln n−1

B−1 +BlnM + 2B . Compare this with (9.2). We loose a factor of 2 essentially.

Implementation: Let b ∈ {w, s} and define

vt(b, m) = p(b, m|y<t) = X

x:xt=b

p((x, m)|y<t) where

v1(b, m) = θ(b) N . Round t: Receive p(yt|y<t, m). Predict:

Pred0t(yt) =X

p(yt|y<t, m)vt(m|w), where

vt(m|w) = vt(w, m) P

m0vt(w, m0).

Let us show that vt can be efficiently computed. A little calculation shows vt+1(b, m) = θ(b|w)p(yt|y<t, m)

Predt(yt) vt(w, m) +θ(b|s)vt(b, m).

One can show that Predt = Pred0t holds for all t = 1, . . . , n. The proof of this hinges upon that

p((x, m)|y≤t) = (p(y

t|y<t,(m,x))p((x,m)|y<t)

Predt(yt , if xt=w;

p((x, m)|y<t), otherwise.

Notice that the vt update has complexity O(N) and storing these values take memory O(N).

DRAFT

9.4. BIBLIOGRAPHIC REMARKS 129

9.4 Bibliographic Remarks

In document Online learning: Algorithms for Big Data (Pldal 124-129)