Exercises - Online learning: Algorithms for Big Data

DRAFT

5.7 Exercises

Theorem 2.2 in the book of Cesa-Bianchi and Lugosi (2006), who call the Exponential Weights Algorithm the Exponential Weighted Average Forecaster.

The Prod algorithm is due to Cesa-Bianchi et al. (2007). The proof of Theorem 5.7 follows from their Lemma 2. Arora et al. (2012) call Prod the Multiplicative Weights Algorithm.

5.7 Exercises

Exercise 5.1. ForA, B >0, find argmin_η>0(^A_η +η B) and also minη>0(^A_η +η B).

Exercise 5.2. (Prediction under Log-Loss in a Probabilistic Environment) Let p be a probability mass function on{1, . . . , N}ⁿand letY1:n∼p. Recall that theN-dimensional simplex is ∆N = n

q∈[0,1]^N : PN

i=1qi = 1o

. Let `(y) : ∆N →R `(y)(q) = −lnq(y) be the log-loss under outcome y∈ {1, . . . , N}. Show that for any 1≤t≤n, y1:t−1 ∈ {1, . . . , N}^t−1,

q .

= arg min

q∈∆_NE

`(Yt)(q)|Y1:t−1 =y1:t−1

is the conditional distribution ofYton the eventY1:t−1 = y1:t−1: qy = Pr (Yt=y|Y1:t−1 =y1:t−1) for any 1≤y≤N.

Exercise 5.3. Let f :U →R be twice differentiable overU ⊂R^d, U open,D⊂U convex (hence we can talk about the derivatives of f overD). Let η >0. Prove the following:

(a) If f is η-exp-concave then f is convex on D.

(b) f isη-exp-concave if and only if for any x∈D∩U, η∇f(x)∇f(x)^> ∇²f(x).

Hint: Take derivatives.

Exercise 5.4. (Exp-Concave Losses) Determine the set of non-negative ηvalues (if any) for which the following functions areη-exp-concave functions of pfor anyy over their support specified in each question:

(i) the logarithmic loss `(p, y) = −I{y= 1}lnp−I{y= 0}ln(1−p) for p ∈ (0,1), y ∈ {0,1};

(ii) the relative entropy loss `(p, y) =yln(y/p) + (1−y) ln((1−y)/(1−p)) for p∈(0,1), y ∈(0,1);

(iii) the squared loss `(p, y) = ¹₂(p−y)² for p, y ∈[0,1];

(iv) the absolute loss`(p, y) =|p−y| for p, y∈[0,1].

DRAFT

94 CHAPTER 5. CONTINUOUS PREDICTIONS AND CONVEX LOSSES Exercise 5.5. (Regret Bound of EWA for Exp-Concave Loss Functions) Prove Theorem 5.6 following the proof of Theorem5.1 and using inequality (5.13).

Exercise 5.6. (Alternative Proof for the Regret Bound of EWA for Exp-Concave Loss Functions) Let D ⊂R^d be a convex set and L be a set of η-exp-concave functions over Dfor some η >0.

(i) Show that the regret Rt,i of EWA on the continuous PEA problem specified by (D,L) against any expert i and any horizont ≥1 satisfies

i=1

e^ηR^t,i ≤

i=1

e^ηR^t−1,i (5.14)

where R0,i = 0 for alli by definition.

(ii) Based on (5.14), prove that the regret bound Rn= min

1≤i≤NRn,i≤ lnN η holds for any n ≥1.

Exercise 5.7. (Loss-Range Sensitive Regret-Bound for EWA) Consider a continuous PEA problem specified by (D,L) and a horizonn.

(a) Fix the learning rate η > 0 of EWA. Prove that its regret Rn with the fixed learning rate η > 0 satisfies (5.15). That is, if Bt = maxi`t(ft,i) −mini`t(ft,i), then Rn ≤

lnN

η +ηn_8n¹ Pn t=1B_t².

(b) Let B be an upper bound on max1≤t≤nBt. Prove that with an appropriately selected learning rate η, the regret of EWA scales linearly with B and in particular Rn ≤ Bq

2nlnN. Specify the value of η with which EWA can achieve this bound.

(d) The choice ofη that gives the best possible bound depends ons² = ¹_nPn t=1

B_t²

8 . Consider the learning rate of Part (b). Calculate an upper bound on the regret of EWA as a function of B and s². Consider the ratio of this upper bound to the optimal bound when η is tuned using the knowledge of s²: One could say that this ratio is the price for choosing a fixed value of η (based on B) instead of tuning η optimally to the “data”

(i.e., to s²). Show an explicit expression for this ratio. How does this ratio behave when B → ∞ (i.e., B overestimates s)? When B →0 (i.e., whenB underestimatess)? Which type of deviation (up or down) is more expensive?

(e) * This part is harder than the rest: Can you design an adaptive algorithm?

DRAFT

5.7. EXERCISES 95

Exercise 5.8. (Adaptivity of Prod) Answer the analogue of Part (d) and (e) of Exercise 5.7 for Prod.

Exercise 5.9. Prove that ln(1 +x)≥x−x² holds for all x≥ −1/2.

Exercise 5.10. (Alternate Regret Bound for EWA) In this problem you prove a regret bound for EWA in an alternate manner and compare this results to that of Theorems 5.1 and 5.7.

(a) Prove that for |x| ≤1, exp(−x)≤1−x+x².

(b) Using this inequality, assuming that D is convex, L is a set of convex functions with values restricted to [−1,1], show that for any 0≤η≤1, EWA enjoys the regret bound

Rn ≤ lnN

η +ηn 1 n

t=1

hwt−1, sti

! ,

wherewt−1,i =wt−1,i/Wt−1 and st,i =`²_t(ft,i).

Rn ≤ lnN

η +ηn 1 n

t=1

hwt−1, ati

! , whereat,i =|`t(ft,i)|.

(d) Compare these bounds to those obtained in this chapter (in particular, Theorems 5.1 and 5.7). Which bound is “the best”? When? Discuss the choice of η.

(e) Suggest an adaptive method to choose the learning rate η and study it either empirically, or analytically, or both empirically and analytically.

Exercise 5.11. (Improved Bounds for Small Losses for EWA) (This exercise is based on Theorem 2.4 and Corollary 2.4 of Cesa-Bianchi and Lugosi (2006)) In this exercise we will explore whether EWA can be expected to improve its regret as the loss of the best expert becomes smaller, the very issue addressed byProd. The idea is to replace Hoeffding’s inequality in proving the upper bound for Wt/Wt−1.

(a) Show that for a random variable taking values in the range [0,1] and for any s∈R, it holds that E

e^sX

≤ exp((e^s−1)E[X]). Note that for s >0 fixed and as E[X] → 0, this bound is tighter than that of Hoeffding’s lemma: the bound here converges to 1, while that of in Hoeffding’s lemma converges to e^s²^/2 >1.

(b) Use the inequality of the previous part in place of Hoeffding’s lemma in the proof of Theorem5.1 to prove that if the loss functions are convex, their range is a subset of the [0,1] interval and η >0 then for anyi,

Lbn−Ln,i≤lnN + lnN

e^η−1 +e^η −1 2 Ln,i.

DRAFT

96 CHAPTER 5. CONTINUOUS PREDICTIONS AND CONVEX LOSSES (c) Note that if we define ˆη =e^η −1, the above inequality can be written as

Lbn−Ln,i≤lnN +lnN ˆ η +ηˆ

2Ln,i.

Describe in words the “meaning” of this. Compare this bound to the previously obtained regret bounds for EWA and Prod. What are the respective virtues of the various bounds and algorithms?

(d) Similarly to Part (b), use the inequality of Part (a) to prove the bound Lbn−Ln,i≤ lnN

η +η 2Lbn. Compare to the bound of Exercise 5.10, Part (c).

Hint: For Part (b), first prove that η ≤sinh(η) = (e^η −e^−η)/2 holds for all η >0. Then use this and the bound of the previous part.

Exercise 5.12. (Simulation Study: Comparing EWA and Prod) Design a simulation study to compare EWA and Prod. Study the tightness of the various bounds derived.

Hint: Read the paper of Johnson(2002) first and then look at the previous exercises. The above paper discusses how to design and run meaningful experiments in computing science, while the previous exercises in this chapter will give you the context.

Exercise 5.13. (Hierarchical EWA) Consider a continuous prediction problem with decision space D, outcome space Y, loss function ` and N experts. As usual, let ` be convex in its first argument. Fix some η > 0. Let EWA(η) denote the EWA algorithm when it is used with the fixed N experts and the learning rateη. Now, consider a finite set of possible values ofη, say,E = {η1, . . . , ηK}. Imagine using theseK instances of EWA in parallel. Each of them will predict in its own way. This new algorithms themselves can be considered as K new, compound experts, giving rise to a “hierarchical EWA algorithm” with two layers.

Which of these compound experts is the best? This might be difficult to decide ahead of time (and in fact, will depend on the outcome sequence), but we can just use EWA to combine the predictions of the K compound experts to arrive at an interesting algorithm, with a hyperparameter η^∗ >0.

(a) Invent a specific prediction problem (D, Y, `) with the required properties and a fixed set of experts such that for some outcome sequences the smallest regret is obtained when η is very close to zero, whereas for some other outcome sequences the largest regret is obtained when η takes on a large value.

(b) Implement the EWA algorithm and test it in the environment that you have described above, both for large and small learning rates. Do your experimental results support your answer to the first part? Hopefully they do, in which case you can consider the next part.

DRAFT

5.7. EXERCISES 97

(c) Implement the hierarchical EWA algorithm described above and test it in the environment you have used above. Select η1, . . . , ηK in such a way that you can get interesting results (include the values of the learning rate used in the previous part). Describe your findings.

As to the value of η^∗, use the value specified in Theorem 5.1.

(d) Is it worth to define yet another layer of the hierarchy, to “learn” the best value of η^∗? How about yet another layer on the top of this? Justify your answer!

Exercise 5.14. Let ft,i ∈ D be the experts’ predictions. Assume that `t : D → [0,1]

convex. Assume that the cumulative loss of expert i after t rounds Lt,i = Pt

s=1`t(fs,i) is positive. For η >0 define

gt(η) = `t

i=1exp(−ηLt,i)ft,i

j=1exp(−ηLt,j)

! .

If this function was convex as a function of η, could you use it for tuning η? How? Is gt

convex if `t is linear?

DRAFT

98 CHAPTER 5. CONTINUOUS PREDICTIONS AND CONVEX LOSSES

DRAFT

Chapter 6 The Randomized Exponential Weights Algorithm

I

^{n Chapter} ⁴we proved that the Weighted Majority Algorithm makes at most ln

1 β

L^∗_n+ lnN ln

2 1+β

mistakes, whereL^∗_nis the number of mistakes of the best expert. Note that here the multiplier of L^∗_n is lower bounded by 2 and hence it is not possible to derive a sublinear regret bound for WMA from this bound. In contrast, for the Exponential Weights Algorithmfrom the last chapter we proved an O(√

nlogN) upper bound on the regret. The reason for this discrepancy is the lack of convexity: While in the last chapter the predictions of the experts could be averaged and the loss was assumed to be convex, this was not the case for the setting that WMA operates in. However, there is a way how to fix this problem with the help of randomization. This chapter explains how to do this.

We consider an online prediction problem which is a common generalization of both Discrete Prediction with Expert Advice and Continuous Prediction with Expert Advice. In each round t, we have to choose a prediction pt from a setD based on predictions ft,1, ft,2, . . . , ft,N ∈D of N experts. The set D is an arbitrary non-empty set. It can be finite or infinite and, in particular, it does not have to be convex. At the end of the round, we receive a loss function `t chosen from a set L. The loss function is used to scored our prediction pt as well experts’ predictions.

Prediction with Expert Advice

In round t= 1,2, . . .:

• Receive the predictions ft,1, ft,2, . . . , ft,N ∈Dof the experts.

• Predict pt∈D using past information.

• Receive a loss function`t:D→R, `t∈ L.

• Incur the loss `t(pt).

DRAFT

100 CHAPTER 6. RANDOMIZED EWA

In document Online learning: Algorithms for Big Data (Pldal 93-100)