Exercises - Online learning: Algorithms for Big Data

DRAFT

4.5 Exercises

physics, where the Lyapunov function often measures the “energy” left in the system, or the state’s “potential”. Hence, the progress measure is also called a potential function and the method of proof the “potential function technique”. Exercise 4.7 explores a bit the relationship to this classical technique. In control theory this technique was used to derive bounds on the performance of various adaptive and robust controllers (French et al., 2003).

In Exercise 4.8, as an illustration of how the proof technique of Theorem 4.2implies other interesting results, we consider the simple greedy algorithms for the well-knownSet-Cover Problem. It is shown there that the greedy algorithm is an instance of WMA and the reader is then asked to prove that this algorithm, subject to reasonable complexity-theoretic assumption and up to a constant factor, achieves the best-possible approximation result for this problem.

4.5 Exercises

Exercise 4.1. (Elimination Strategies) Consider the Prediction With Expert Advice problem with zero-one loss under the extra assumption that there exists an expert that never makes a mistake. Let St denote the set of experts that did not make any mistake in rounds 1,2, . . . , t.

(a) Prove that any algorithm that in round t chooses an arbitrary expert i ∈ St−1 and predicts ft,i makes at most N −1 mistakes.

(b) Consider the algorithm that in every round chooses the smallest index i ∈ St−1 and predicts ft,i. Show that the algorithm can makeN −1 mistakes in the worst case. More precisely, construct sequences {yt}^∞t=1, {ft,1}^∞t=1, {ft,2}^∞t=1, . . ., {ft,N}^∞t=1 on which the algorithm makes N−1 mistakes in the first N −1 rounds and the sequences satisfy the assumption that for at least one i,f1,i =y1, f1,i =y2, . . ..

Exercise 4.2. (Mistake Lower Bound with a Perfect Expert) Consider the Predic-tion With Expert Advice problem with zero-one loss under the extra assumption that there exists an expert that never makes a mistake. Prove that in the worst case the expected number of mistakes is at least cblog₂Nc for any (possibly randomized) prediction algorithm, where c >0 is some constant.

Exercise 4.3. (Mistake Bound for deterministic FTL) Consider the Prediction With Expert Advice problem with zero-one loss. Show that the cumulated loss of the deterministic FTL algorithm, which breaks ties in a deterministic way, after n rounds is bounded byN L^∗_n+ (N−1), whereL^∗_n = min1≤j≤NLn(j). Prove that the bound is tight, that is, for any N > 0 and L^∗_n ≤n/N −1, there exist sequences {yt}^∞t=1,{ft,1}^∞t=1, . . . ,{ft,N}^∞t=1

such that the deterministic FTL algorithm makes N L^∗_n+ (N −1) mistakes.

Hint: Consider how many times it can happen that the FTL algorithm incurs loss but the minimum loss of the reference predictors does not increase.

DRAFT

76 CHAPTER 4. MISTAKE BOUNDS FOR THE ZERO-ONE LOSS Exercise 4.4. (From WMA to Halving) Show that by choosingβ as a function of L^∗_n and N to optimize the bound of Theorem 4.2, and letting L^∗_n → 0, the mistake bound of Theorem 4.1 can be recovered from the bound of Theorem4.2.

Exercise 4.5. (Simplifying the Bound of WMA) Consider theWeighted Majority Algorithm.

(a) Show that the upper bound on the number of mistakes of WMA can be further upper bounded by

βL^∗_n+ 2

1−βlnN .

(b) Sometimes, it is more convenient to use α= 1−β (i.e., writing the weight update of the experts i who made a mistake in the form wt+1,i= (1−α)wt,i). Assuming that α <1/2 (the update is not too aggressive), show that the number of mistakes of WMA can be

upper bounded by

2(1 +α)L^∗_n+ 2 αlnN .

(c) Assuming that L^∗_n is known before the game starts, how would you set α of the previous part to get the smallest upper bound there? When is this bound valid (check earlier conditions)?

(d) Design an adaptive version of WMA that selects α based on the observed data. Design and run some experiments to validate your algorithm.

Note that as L^∗_n→ ∞, it can be shown that any deterministic algorithm will make at least 2L^∗_n mistakes asymptotically, i.e., the bound is unimprovable.

Hint: Parta: First prove that ln(1 +x)≤xholds whenx >−1 with equality iffx= 0. Then, using this inequality show that ln_1+β² ≥ ^1−β₂ and also that ln(1/β)≤ ^1−β_β .

Hint:Partb: One way to do this is to prove thatβ 7→ ln(2/(1+β))^ln(1/β) is convex on (0,1) and then use convexity to show that on (1/2,1), ln(2/(1+β))^ln(1/β) can be upper bounded by 2−β= 1+1−β= 1+α.

A second way to prove this part is to first show that−ln(1−x)≤x+x² holds forx <1/2, and then use this inequality to upper bound ln(1/β) by (1−β) + (1−β)² = (1−β)(2−β).

Notice the major improvement compared to the bound of the previous part for the same quantity over β ∈(0.5,1). Combine the bound obtained with the upper bound on (ln_1+β² )⁻¹ from Part a to finish the proof.

Exercise 4.6. (Lower Bounds Without a Perfect Expert) Consider thePrediction With Expert Advice problem with zero-one loss. We allow any expert to make any number of mistakes.

(a) Prove that for any deterministic algorithm there exist sequences {yt}^∞t=1, {ft,1}^∞t=1, {ft,2}^∞t=1,. . .,{ft,N}^∞t=1on which the algorithm makes a mistake in every round. Conclude that in the first n rounds the algorithm makesn mistakes.

DRAFT

4.5. EXERCISES 77

(b) Prove that for any, possibly randomized, algorithm there exist sequences{yt}^∞t=1,{ft,1}^∞t=1, {ft,2}^∞t=1,. . .,{ft,N}^∞t=1 such that in any round the probability (with respect to algorithm’s internal randomization) that the algorithm makes a mistake is at least 1/2. Conclude that the expected number of mistakes the algorithm makes on these sequences in first n rounds is at least n/2.

(c) Lety1, y2, . . . be any sequence of outcomes. Consider two experts: In every round t, the first expert predicts ft,1 = 0 and the second expert predicts ft,2 = 1. Show for any n, at least one of the experts makes at most n/2 mistakes in every round.

(d) Combine parts (a) and (c) to show that for any deterministic algorithm there exist sequences {yt}^∞t=1, {ft,1}^∞t=1, {ft,2}^∞t=1, . . ., {ft,N}^∞t=1 such that the regret after n rounds is at least n/2.

Exercise 4.7. (Relationship to Lyapunov’s Stability Theory) Letf :R^d→R^d be a sufficiently smooth function, x0 ∈R^d a vector and consider the autonomous system

x=f(x) (4.4)

with initial condition

x(0) = x0. (4.5)

A mapping x: [0,∞)→R^d is called a solution to (4.4) if x is differentiable on (0,∞) and

dtx(t) = f(x(t)) holds for any t > 0. The mapping x is called a solution to (4.4)–(4.5) if it is a solution to (4.4) and if x(0) = x0 is also satisfied. Given f, will all solutions satisfy lim sup_t→∞kx(t)k ≤B with some B <∞? Assuming that zero is an equilibrium point off, i.e., f(0) = 0, will all solutions converge to zero eventually? The main idea of Lyapunov’s theory is to find a functionV :R^d→R (a “Lyapunov function” of (4.4)) and then show that for any solution x=x(·) of (4.4),v(t) = V(x(t)) does not increase as a function of t as this immediately implies that the solution stays within the level set

u∈R^d : V(u)≤V(x(0)) . Then, for example, if this level set is bounded, the solution will stay bounded. Boundedness is usually shown by showing that V(x) increases without limit as kxk → ∞ and that V is lower bound by, say, zero. The main tool to show that V does not increase along the solutions of (4.4) is by studying the derivative ofv. Since assuming thatV is smooth, _dt^dv(t) = hV⁰(x(t)), x⁰(t)i=hV⁰(x(t)), f(x(t))i, the derivative of v can be studied without knowing any specific solutionsx(·), just by studying the function ˙V(u) .

=hV⁰(u), f(u)i. For example, if this function is always negative, we immediately see thatvcannot increase no matter what solution x(·) is used to define it. Further information about the properties of the solutions can often be read out by bounding ˙V(u). For the sake of a simple example assume that V is lower bounded by zero and we are interested in bounding RT

0 kx(t)k²dt. Assume further that we know that V˙(u)≤ −kuk². This gives that ˙V(x(t))≤ −kx(t)k² for any solutionx(·) and anyt >0. Now, 0 ≤ V(x(T)) = V(x(0)) +RT

0 v⁰(t)dt = V(x(0)) +RT

0 V˙(x(t))dt ≤ V(x(0))−RT

0 kx(t)k²dt, which gives RT

0 kx(t)k²dt≤V(x(0)). This type of reasoning was used by French et al. (2003) to derive bounds on the performance of various adaptive controller.

DRAFT

78 CHAPTER 4. MISTAKE BOUNDS FOR THE ZERO-ONE LOSS In this exercise you are asked to put the result of this chapter concerning WMA into a form that closely resembles the above sketched reasoning. Fixβ ∈(0,1), let`t,i =I{ft,i 6=yt}. Define the “Lyapunov function candidate” Q:R^N →R as Q(s) =siln(1/β) + lnPN

i=1β^sⁱ. LetLt = (Lt,1, . . . , Lt,N)^> be the cumulated losses of theN experts. The claim is that Qacts as a Lyapunov function associated with the “trajectories” t7→Lt.

(a) Show that 0≤Q(s) for any s∈R^N.

(b) The analogue of thatv decreases along the trajectories is thatt 7→Q(Lt) does not increase

“by much”: Derive Q(Lt+1) ≤ Q(Lt) +at where at = ln

1 β

`t,i −I{pt6=yt}ln

2 1+β

, wherept is prediction of WMA at time step t.

t=1at ≤ Q(0) +Pn

t=1at. Plug in the definition of at and solve for the loss of WMA.

Note that all the later results of the book can also be put into this form.

Exercise 4.8. (Set Cover Problem) (Arora et al. (2012)) Let X = {1,2, . . . , N} for some integerN ≥2. A set systemS⁰ ⊆2^X is said to coverX if∪^A∈S⁰A= X. In what follows, for brevity, and following common practice, we will just write ∪S⁰ instead of ∪^A∈S⁰A.

Consider a set system S ⊆ 2^X that covers X. We are interested in finding a minimum cover ofX fromS, that is,

S^∗ = argmin

S⁰⊆S,∪S⁰=X|S⁰|,

where |S⁰|denotes the cardinality of the setS⁰. Note that the corresponding decision problem is known to be NP-hard.

The goal in this exercise is to show that there exists an efficient algorithm that finds a covering ˆS ⊆S whose size is not much larger than that of the optimal coveringk^∗ .

=|S^∗|. In particular we will consider the “greedy” algorithm that builds a covering ˆS incrementally by always selecting a set from S containing the most uncovered elements of X. For any A⊆ X let 1_A denote indicator vector of A, that is, the N dimensional binary vector whose ith coordinate isI{i∈A}. We can write the greedy algorithm as follows:

Set Cover Algorithm

Initalization: w1,i = 1 for 1≤i≤N, ˆS0 =∅. For t= 1,2, . . .:

(a) Select At ∈argmaxA∈Sh1_A, wti, breaking ties arbitrarily.

(b) Let wt+1,i =wt,i(1−`t,i) where `t=1_A_t. (c) Set ˆSt = ˆSt−1∪At.

(d) Return ˆSt if ˆSt covers X.

DRAFT

4.5. EXERCISES 79

Note that this is indeed the classical “greedy” algorithm: wt,i = 1 means that by the end of round t, i still needs to be covered. The algorithm is written in the above form to drive home the message that it is just an instance of the WMA algorithm.

Assume that the above algorithm stops at time T (i.e., ˆST covers X, but ˆST−1 does not).

With the proof technique used to analyze WMA show that T ≤ dk^∗lnNe. Hint: For any t let Wt = PN

i=1wt,i, and for 1 ≤ t ≤ T, let pt,i = wt,i/Wt. Show that for 1≤t≤T, Wt+1 =Wt(1− h`t, pti).

Hint: Show that for any 1≤t < T, h`t, pti ≥1/k^∗. Note that h`t, pti= maxA∈Sh1_A, ptiand use the fact that the maximum of some values upper bounds their average. To get the final bound, use the inequality 1−x≤e^−x.

Assuming reasonable complexity-theoretic conjectures, up to constant factors, the lnN -approximation result proved here is the best-possible for this problem (Feige, 1998).

Exercise 4.9. (WMA with Non-Uniform Weights) Assume that WMA is initialized with non-uniform weights. Specifically, assume that w0,i are positive numbers such that PN

i=1w0,i = 1 (i.e.,, w0 = (w0,i)i lies in the N-dimensional probability simplex). The weights can be used to express the prior beliefs about the quality of the experts.

(a) Prove a generalization of Theorem 4.2for this case that let’s you bound the loss of WMA against the ith expert, where iis arbitrary. Interpret the bound that you have obtained.

(b) Extend the results to the case when there are countably infinitely many experts.

DRAFT

80 CHAPTER 4. MISTAKE BOUNDS FOR THE ZERO-ONE LOSS

DRAFT

Chapter 5 Continuous Predictions and Convex Losses

O

^ftentimes predictions are continuous valued. In Chapter1 we saw two problems of this type: Predicting stock prices and predicting click-through rates. Other examples include predicting the pose of a robot’s body given an initial pose and the torques applied, predicting the load on an electricity network, or wind. In this chapter we consider the continuous analog of the previous chapter’s problem. In particular, we will assume that the predictions lie in a subset D of a d-dimensional Euclidean space. In the click-through rate prediction problem, D would be the [0,1] interval, in the stock-price prediction problem with d stocks we would choose D= (0,∞)^d, etc.

With this, the interaction between the environment and the forecaster happens as follows:

in each round tbefore it chooses the predictionpt∈D, the forecaster observes the predictions ft,1, ft,2, . . . , ft,N ∈ D of N experts. The predictions are evaluated by a loss function

`t : D → R, chosen by the environment from a set L of loss functions, which is told the forecaster after it has chosen its prediction. This then repeats in roundt+ 1, etc. As in the discrete case, the goal of the forecaster is still to produce predictions whose long-term cumulated loss is comparable to that of the best expert in hindsight.

DRAFT

82 CHAPTER 5. CONTINUOUS PREDICTIONS AND CONVEX LOSSES Continuous Prediction with Expert Advice (PEA)

In round t= 1,2, . . .:

• Receive the predictions ft,1, ft,2, . . . , ft,N ∈Dof the experts.

• Predict pt∈D based on the received predictions.

• Receive a convex loss function `t :D→Rbelonging to L.

• Incur the loss `t(pt).

The goal of the forecaster: Minimize the regret Rn=

t=1

`t(pt)− min

1≤i≤N n

t=1

`t(ft,i) = Lbn− min

1≤i≤NLn,i.

It is assumed that D ⊂ R^d is convex and L contains only convex functions: Recall that a set is convex if the points of the segment connecting any two points of the set also belong to it, while a real-valued functionf with domain D⊂R^d is convex if the epigraph epi(f) ={(x, y) : y≥f(x), x∈D}off is convex. (In words, epi(f) is the set of points (x, y) that lie above the function’s graph, graph(f) ={(x, f(x)) : x∈D}. Clearly, the domain of any convex function has to be convex and this definition of convexity is equivalent to the more widely used definition that for any two points x, y in the domain of the function f and 0 < α < 1, f(αx+ (1−α)y) ≤ αf(x) + (1−α)f(y).) The convexity assumptions made will play a crucial role in this and later chapters. In Chapter 12 we will review the basic definitions and some key ideas in convex analysis.

In document Online learning: Algorithms for Big Data (Pldal 75-82)