Exercises - Online learning: Algorithms for Big Data

DRAFT

6.3 Exercises

Proof. Let Xt=`t(Pt) be the loss incurred by REWA in round t. Note that the cumulated loss of REWA for then rounds is Lbn= Pn

t=1Xt. Since (It)1≤t≤N are independent, (Xt)1≤t≤N

is a sequence of independent random variables and 0≤Xt ≤1 with probability one for allt.

Hence, Theorem C.7 gives that with probability at least 1−δ, Lbn≤E[Lbn] +

2ln(1/δ). This, together with the choice η= q

8 lnN

n and Theorem 6.1give that with probability a least 1−δ,

Lbn≤ min

1≤i≤NLi,n+ rn

2 lnN + rn

2ln(1/δ).

Since Rn = Lbn−min1≤i≤NLi,n, this gives a high-probability regret bound, and in fact shows that the tail of the regret is subgaussian.

6.3 Exercises

Exercise 6.1. Consider the setting of Exercise 5.13, just now for a discrete prediction problem. That is, the goal is to find the best learning rate for randomized EWA from a finite pool {η1, . . . , ηK}. One possibility is to run a randomized EWA on the top of K randomized EWA forecasters, each using some learning rateηk,k= 1, . . . , K. The randomized EWA “on the top” would thus randomly select one of the randomized EWA forecasters in the “base”, which would in turn select an expert at random. Another possibility is that if p⁰_t = (w⁽⁰⁾_t,1, . . . , w⁽⁰⁾_t,K)∈[0,1]^K is the probability vector of the randomized EWA on the top for round t, and p^(k)_t = (w^(k)_t,1, . . . , w^(k)_t,N)∈[0,1]^N is the likewise probability vector of the k^th randomized EWA in the base, then just select expert i with probability PK

k=1w⁽⁰⁾_t,kw^(k)_t,i, i.e., combine the “votes” across the two layers before selecting the expert. Which method would you prefer? Why? Design an experiment to validate your claim.

Exercise 6.2. (WMA vs. EWA in i.i.d. Environments)

(a) Compare EWA and WMA in simulation in i.i.d. environments. One idea for the core of how to run these experiments is as follows: Choose N probabilities (pi)i, pi representing the probability that expert i will be mistaken in a given round (for simplicity, you can order these numbers such that p1 ≤p2 ≤ . . . pN). Generate the i.i.d. loss vectors (`t)1≤t≤n, where the components of the loss vectors are also independently selected and Pr (`t,i = 1) =pi. Feed these loss vectors to both EWA and WMA and count the number of mistake they make.

What can you conclude about the respective performances of EWA and WMA?

(b) Consider WMA, but assume that the outcome sequence is i.i.d.. Prove a bound on the expected cumulated loss of WMA. Compare the bound to that of EWA. What do you conclude?

DRAFT

104 CHAPTER 6. RANDOMIZED EWA

Exercise 6.3. (REWA with Discretization) Consider a decision space D ⊂ X, but this time only assume that A is a metric space and the loss functions are L-Lipschitz. (Let A= (A, dA),B = (B, dB) be metric spaces. A functionf :A→B betweenA andB is called L-Lipschitz if dB(f(a), f(a⁰)) ≤ LdA(a, a⁰) holds for any a, a⁰ ∈ A. For Euclidean spaces, unless otherwise mentioned the metric is the Euclidean distance.) Assume that for any given ε >0, D can be covered byN(ε) balls (N(ε) is called the covering number of D w.r.t.

the metric of X at scaleε).

(a) Prove a regret bound for the Regret Minimization Game specified using (D,L) and horizon n. The regret bound may depend on N(·).

(b) Calculate the covering numbers N(·) for the d-dimensional unit ball in thed-dimensional Euclidean space. What regret bound do you get for L-Lipschitz functions over the unit ball in this case? What is the pitfall of this approach? What is this approach good for?

Hint: Use a discretization argument and REWA. Choose the free parameter in the bound to minimize the bound.

Exercise 6.4. (Efficient Implementation of REWA for the Online Shortest Path Problem) Let G = (V, E) be a directed acyclic graph where V denotes the set of vertices and E denotes the set of edges. Consider the online shortest path problem for a source node u∈V and a destination nodev ∈V. LetP denote the set of paths from u tov, and assume P is nonempty. In each round t of the game, the environment selects a loss function

`t : E → [0,1]. The cost of any path P ∈ P from u to v is defined as `t(P) = P

e∈P`t(e) where e∈P denotes that edgee∈E belongs to path P. The game proceeds as follows:

In round t= 1,2, . . .

• Simultaneously

– Environment selects a loss function`t:E →[0,1].

– Forecaster selects a pathPt∈ P.

• Forecaster suffers loss `t(Pt) and observes the loss function `t.

The goal of the forecaster is to minimize its regret relative to any fixed path p∈ P in n rounds.

(a) Rephrase the problem as an instance of the prediction with expert advice problem.

(b) Apply REWA and analyze its regret.

(c) The vanilla implementation of REWA requires to update one weight for each path in P, which is typically exponentially large in |V| and|E|. Give an efficient implementation of REWA that requires only O(|E|) computations in each round.

Hint: Select the edges of the path one by one, starting from u. Use dynamic programming to compute the required distributions efficiently; keep one weight for each edge. Prove that the distribution of the selected random path is identical to that in REWA.

DRAFT

Chapter 7 Lower Bounds for Prediction with Expert Advice

U

^pper bounds on the worst-case regret of prediction algorithms provide strong guarantees on their performance. However, on their own, they are not entirely satisfactory: Could it be that the analysis of the regret was lose and that much better bounds are possible? Or could it be that an algorithm that enjoys a much better upper bound on its worst-case regret exist just we have not found it yet? Ideally, we would like to know, for a given Online Optimization Game, what is the best possible forecaster and what is its worst-case expected regret. A step towards this goal is a lower bound on the worst-case regret of any algorithm for the game. If the lower bound equals (or is close) to an upper bound of a forecaster, we can declare the forecaster to be (close to) worst-case optimal.

In this chapter, we prove such lower bound for the problem of Prediction with Expect Advice. We show that the worst-case expected regret of any algorithm is at least Ω(√

nlogN), where N is the number of experts and n is the number of rounds of the game.

The same lower bound holds for Continuous Prediction with Expert Advice. This lower bound can be obtained by a simple modification of the argument presented in this chapter, which we leave as an exercise (Exercise 7.5). These two lower bounds together with the O(√

nlogN) upper bounds from Chapters 5 and 6 imply that the (Randomized) Exponentially Weighted Forecasteris worst-case optimal to within a constant factor.

What is more, in this chapter we also show that it is in fact optimal in an asymptotic sense as the number of rounds and experts goes to infinity.

To prove the lower bound, we view the Prediction with Expect Advice as an Online Optimization Game with decision set D={1,2, . . . , N} and set of loss functions L = [0,1]^D. We can think of an element i ∈ D as the ith expert and we assume that it predicts ft,i = i. In each roundt, the forecaster chooses an expert pt∈D and receives a loss function`t∈ L. The protocol, from the point of view of the forecaster thus looks as follows:

DRAFT

106 CHAPTER 7. LOWER BOUNDS FOR PREDICTION WITH EXPERT ADVICE Prediction with Expert Advice

In round t= 1,2, . . .:

• Predict pt∈ {1,2, . . . , N}.

• Receive loss function `t ∈[0,1]{1,2,...,N}.

• Incur loss `t(pt).

The key to proving a lower bound is to cancel the effect of the forecaster. The easiest way of achieving this is to eliminate all “patterns” from the losses. Formally, this can be done by choosing the losses assigned to the experts randomly, independently of the past.

Then, all forecasters achieve the same expected loss which grows linearly with time and the question will be how this compares to the expected loss of the best expert. Since an expectation of a (discrete-valued) random variable is always smaller than the largest value that the random variable can take with positive probability, if the expected regret against a randomized environment is large, we also get that there exist a sequence of losses (chosen obliviously, knowing the forecaster’s strategy) that for the given forecaster give rise to a large regret. This idea of choosing random losses to cancel the forecaster will be used multiple times in the book to prove lower bounds for the worst-case regret.

In document Online learning: Algorithms for Big Data (Pldal 103-106)