DRAFT
9.3 Log-loss and Tracking
In Section 5.2we considered the problem of assigning probabilities to symbols coming from a finite set Y under the log-loss. The decision set is
D = (
p∈[0,1]Y : X
y∈Y
p(y) = 1 )
.
In every round t, the environment picks a symbol yt ∈Y. The loss of prediction p∈ D is
`t(p) = −lnp(yt). Hence, L=
`(y) : `(y):D→[0,∞), `(y)(p) = −lnp(y) .
Assume we haveN experts to help us with this prediction problem, each expert predicting some probability distributionft,i ∈Dover Y in every time step. The probability distribution ft,i chosen by expert iat round t is allowed to depend on the past symbols y1, . . . , yt−1. To emphasize this, we writept(·|y1:t−1, i) instead of ft,i. Thus, the experts are identified by these maps (pt(·|·, i))t,i. This notation helped us to make a connection to Bayesian prediction. As before, the goal is to compete with the best expert in hindsight: After n rounds, the regret of a forecaster that predicts pt∈D in round t is
Rn =
n
X
t=1
`t(pt)− min
1≤i≤N n
X
t=1
`t(ft,i).
Select a probability mass function p0 over {1, . . . , N}. We can think of p0 as assigning the “prior” probability p0(i) to expert i. The maps (pt(·|·, i))t,i together with p0 define a probability distribution over Yn, the space of sequences of length n:
P(x1, . . . , xn) =
N
X
i=1
p0(i)
n
Y
t=1
p(xt|x1:t−1, i)
| {z }
P(x1,...,xn|i)
, x1, . . . , xn∈Y . (9.1)
This is a mixture of the N probability distributions P(·|1), . . . , P(·|i). Drawing X1, . . . , Xn
from this probability distribution P can be thought of picking first the index I of an expert and then drawing X1, X2, . . . sequentially from pt(·|X1:t−1, I).
As derived before, according to Bayes’ law, for any x1, . . . , xt∈Y, P(xt|x1:t−1) =
P
iP(xt|x1:t−1, i)P(x1:t−1|i)p0(i) P
iP(x1:t−1|i)p0(i) . We have also seen that the regret after n rounds satisfies
Rn≤lnN .
For exp-concave losses, we would get the same. If we consider the meta-experts of the previous section, this gives
Rn =O(Bln(n) +Bln(N)).
DRAFT
9.3. LOG-LOSS AND TRACKING 125
Hence, now the regret grows with the number of rounds, as opposed
Now assume that the meta-experts only use at most M base experts out of the N experts, where possibly M N:
En,B = (
f :n→N :
n−1
X
t=1
I{f(t)6=f(t+ 1)} ≤B−1,| {f(1), . . . , f(n)} | ≤M )
, We do not know which of the experts are to be used, but we assume that only a few experts are needed to get a good loss. This is a form of sparsity bias. The number of experts in this case is at most
M0(n, N, B, M) = N
M
M(n, M, B) which can be much smaller than M(n, N, B). Taking the log,
logM0 ≈MlnN +B(lnM+ lnn). If we were able to implement EWA, we would get
E[Rn]≤ln M
N
+ ln
n−1 B−1
+BlnM (9.2)
9.3.1 Specialist Framework
A partition specialist π is given by a subset W ⊂ {1, . . . , n} of awake times and an expert index 1≤m ≤N. If the wake times of some partition specialists π1, . . . , πk form a partition of {1, . . . , n}then BLAH!!?!?!?
One meta-expert can be represented by at most M specialists.
If a specialist is awake than it predicts as the underlying expert, otherwise it predicts as the algorithm. ?????
The regret against specialist (W, m), Rn(W, m) = X
t∈W
−lnP(yt|y≤t) + lnP(yt|y<t, m). Then,
Rn(W, m) =−lnP(y≤n) + lnP(y≤n|m, W). Take any distribution U over the set of specialists. Then,
Rn(U) .
= X
(W,m)
U(W, m)Rn(W, m) = X
(W,m)
U(W, m) (−lnP(y≤n) + lnP(y≤n|m, W)). U will be uniform over theM specialists representing a single meta-expert.
For two distributions over the same finite domain U, P, define the divergence to be D(U, P) =P
iU(i) lnU(i)P(i). LetP be the initial distribution over the set of specialists and we predict using Bayes-rule. Applying Bayes-rule, we get
Rn(U) =D(U, P)−D(U, P(·|y≤n)).
DRAFT
126 CHAPTER 9. PREDICTION WITH MANY EXPERTS But can we calculate the Bayes-update? By our convention that the specialist predict as the algorithm when they are asleep,
P(yt|y<t) = X
(W,m):t∈W
P(yt|y<t, m)P((W, m)|y<t) + X
(W,m):t6∈W
P(yt|y<t)P((W, m)|y<t)
Solving the equation for P(yt|y<t) we get P(yt|y<t) =
P
(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) 1−P
(W,m):t6∈WP((W, m)|y<t)
= P
(W,m):t∈WP(yt|y<t, m)P((W, m)|y<t) P
(W,m):t∈W P((W, m)|y<t) .
================== Lecture on Oct 28: ==============
Prediction with log-loss N experts, predictions of mth expert: p(yt|y<t, m). The losses are:
Lbn = lnp(y≤n), Ln,m =−lnp(y≤n|m).
Choose U(m) – a distribution over the experts. Define the weighted regret:
Rn(U) =
N
X
m=1
U(m) (−lnp(y≤n) + lnp(y≤n|m)). By Bayes theorem,
p(y≤n|m) = p(y≤np(m|y≤n) p(m) and substitution,
Rn(U) =
n
X
i=1
U(m) (−lnp(y≤n) + lnp(y≤n)−lnp(m) + lnp(m|y≤n))
=
n
X
i=1
U(m) lnU(m)
p(m) −U(m) ln U(m) p(m|y≤n)
= KL (Ukp)−KL (Ukp(m =·|y≤n))
≤KL (Ukp) . .
where in the last line we used that the KL divergences are nonnegative (cf. Exercise 9.1).
Now, up to the specialist framework.
A partition specialist π is given by a subset W ⊂ {1, . . . , n} of awake times and an expert index 1≤m ≤N. If the wake times of some partition specialists π1, . . . , πk form a partition of {1, . . . , n}then BLAH!!?!?!?
LetWtbe the set of specialists who are awake at time t. Put a priorp0 over the specialists.
Define
Predt(yt) = P
s∈Wtp(yt|y<t, s)p(s|y<t) p(Wt|y<t)
DRAFT
9.3. LOG-LOSS AND TRACKING 127
Let the extended specialist based on specialist s predict Predt whens is sleeping at time t and predict like specialist s otherwise. Let Pred(yt|y<t, s) denote the prediction of such an extended specialist. Define p0 using these extended specialists. Then, one can show that
Predt(y) =p0(yt|y<t).
For simplicity, we will use p in place ofp0. Regret against (x, m):
Rn(x, m) = X
t:xt=w
−ln Predt(yt)+lnp(yt|y<t, m) =
n
X
t=1
−ln Predt(yt)+ln Pred(yt|y<t,(x, m)). Weighted regret:
X
(x,m)
U((x, m))Rn(x, m)≤KL (Ukp0).
Going back to tracking a small, but unknown number of experts. Time is partitioned into at most B segments. A meta-expert is defined by b−1 switch points with b ≤B and expert indices i1, . . . , ib ∈ {1, . . . , N}, where ij 6=ij+1, 1≤j ≤b−1.
We consider only the case where i1, . . . , ib belong to a fixed subset E ⊂ {1, . . . , N} of cardinality at most M ≤N.
Solving tracking by specialists!
Each meta-expert e can be equivalently represented by at most M partition specialist whose set we denote by Se: To each expert index in{i1, . . . , ib} we define a specialist who will be awake exactly when the corresponding expert is used by the meta-expert e. Choose p0 to be a distribution over the set ∪eSe.
The cumulated loss of this meta expert is the same as the sum of the cumulated loss of the specialists defined for it: Ln,e=P
s∈SeLn(s).
DefineUe(·) to be the uniform distribution overSe. The algorithm that uses the extended specialists will have a regret
Lbn−Ln,e =|Se|X
s∈Se
Ue(s)Rn(s)≤ |Se|KL (Uekp0) . Choose
p0((x, m)) =p0(x)p0(m) =p0(x)/N . For simplicity assume |Se|=M. Then,
MKL (Uekp0) =Mln N
M + X
(x,m)∈Se
ln 1 p0(x).
It remains to select p0 over the wake-sleep patterns so that the regret is small, yet we can calculate the updates efficiently. We will choose p0 =p0(x) to be a Markovian prior:
p0(x) =θ(x1)θ(x2|x1). . . θ(xn|xn−1) (recall that x= (x1, . . . , xn)∈ {s, w}n. We have
X
(x,m)∈Se
ln 1
p0(x) =−ln Y
(x,m)∈Se
p0(x).
DRAFT
128 CHAPTER 9. PREDICTION WITH MANY EXPERTS We need a lower bound on Q
(x,m)∈Sep0(x). Reordering the terms in p0, exploiting that due to the definition of Se, exactly one specialists is awake at any time,
Y
(x,m)∈Se
p0(x) =θ(w)θ(s)M−1θ(s|s)(M−1)(n−1)−(B−1)θ(w|w)(n−1)−(B−1)θ(w|s)(B−1)θ(s|w)(B−1).
Now, optimizing θ gives θ(w) = 1/M,θ(s|w) = B−1n−1 and θw|s =???
Then, Rn≤Mln N
M +Mln 1
M + (n−1)h
B−1 n−1
+ (M −1)(T −1)h
B −1 (n−1)(M−1)
, where
h(p) =−plnp−(1−p) ln(1−p)
is the binary entropy function. We can simplify this by using nln(k/n)≤kln(n/k) +k, Rn≤Mln N
M + 2(B−1) ln n−1
B−1 +BlnM + 2B . Compare this with (9.2). We loose a factor of 2 essentially.
Implementation: Let b ∈ {w, s} and define
vt(b, m) = p(b, m|y<t) = X
x:xt=b
p((x, m)|y<t) where
v1(b, m) = θ(b) N . Round t: Receive p(yt|y<t, m). Predict:
Pred0t(yt) =X
p(yt|y<t, m)vt(m|w), where
vt(m|w) = vt(w, m) P
m0vt(w, m0).
Let us show that vt can be efficiently computed. A little calculation shows vt+1(b, m) = θ(b|w)p(yt|y<t, m)
Predt(yt) vt(w, m) +θ(b|s)vt(b, m).
One can show that Predt = Pred0t holds for all t = 1, . . . , n. The proof of this hinges upon that
p((x, m)|y≤t) = (p(y
t|y<t,(m,x))p((x,m)|y<t)
Predt(yt , if xt=w;
p((x, m)|y<t), otherwise.
Notice that the vt update has complexity O(N) and storing these values take memory O(N).
DRAFT
9.4. BIBLIOGRAPHIC REMARKS 129