Online learning: Algorithms for Big Data

(1)

Online learning: Algorithms for Big Data

András György Dávid Pál Csaba Szepesvári Branch: master @ 963d4dd

Head tags: No tags

Date: 2018-07-27 16:59:19 +0100

Last changed by: Szepi

(2)

DRAFT

2

(3)

DRAFT

I Introduction 9

1 Introduction 13

1.1 Minimax Strategies and Learning . . . 14

1.1.1 Loss Minimization Games . . . 15

1.1.2 Regret . . . 16

1.1.3 Unknown Horizons and Other Forms of Adaptivity . . . 17

1.1.4 Generalizations . . . 18

1.2 Examples . . . 19

1.2.1 Prediction of Stock Prices . . . 19

1.2.2 Click-Through Rate Prediction . . . 20

1.2.3 Maximizing Click-Through Rates Interactively . . . 21

1.2.4 Network Routing . . . 22

1.2.5 Compression . . . 23

1.2.6 Supervised Learning . . . 24

1.2.7 Unsupervised Learning . . . 25

1.3 Summary . . . 26

1.4 Bibliographic References . . . 26

1.5 Exercises . . . 27

2 Shoot High, Aim Low 37 2.1 Follow the Leader Algorithm . . . 38

2.2 Lower Bound for the Minimax Expected Regret . . . 41

2.3 Summary . . . 43

2.4 Bibliographic Remarks . . . 44

3 Full Information Regret Minimization Games ^∗ 51 3.1 Strategies . . . 52

3.2 Worst-Case Regret and Minimax Regret . . . 52

3.3 Deterministic vs. Randomized Strategies . . . 53

3.4 Deterministic Strategies and the Minimax Regret . . . 54

3.5 Oblivious and Stochastic Environments . . . 55

3.6 Adaptive Environments and Independently Randomizing Forecasters . . . . 56

3.7 Summary . . . 60

(4)

DRAFT

4 CONTENTS

II Expert Framework 63

4 Mistake Bounds for the Zero-One Loss 67 4.1 The Halving Algorithm . . . 68

4.2 The Weighted-Majority Algorithm . . . 70

4.3 Summary . . . 74

5 Continuous Predictions and Convex Losses 81 5.1 The Exponential Weights Algorithm . . . 82

5.2 Probability Prediction under the Log-loss . . . 87

5.3 Exp-Concave Loss Functions . . . 89

5.4 The Prod Algorithm^∗ . . . 90

5.5 Summary . . . 92

6 Randomized EWA 99 6.1 Randomized Exponential Weights Algorithm. . . 100

6.2 High Probability Bound . . . 102

6.3 Exercises . . . 103

7 Lower Bounds for Prediction with Expert Advice 105 7.1 Optimal Asymptotic Lower Bound . . . 106

7.2 A Non-Asymptotic Lower Bound . . . 108

7.4 Exercises . . . 111

8 Follow the Perturbed Leader 115 8.1 Bibliographic Remarks . . . 119

8.2 Exercises . . . 119

9 Prediction with Many Experts 121 9.1 Linear Combinatorial Prediction Problems and FPL . . . 121

9.2 Time Varying Environments . . . 122

9.3 Log-loss and Tracking . . . 124

9.3.1 Specialist Framework . . . 125

9.5 Exercises . . . 129

(5)

DRAFT

CONTENTS 5

10 Tracking 131

10.1 The Problem of Tracking . . . 131

10.2 Fixed-Share Forecaster . . . 133

10.3 Analysis . . . 136

10.4 Variable-Share Forecaster . . . 136

10.5 Exercises . . . 138

11 Applications of the Experts Framework: Boosting^∗ 139 11.1 Boosting . . . 139

11.2 Summary . . . 142

11.3 Bibliographic Notes . . . 142

11.4 Exercises . . . 143

III Online Convex Optimization 145

12 Background on Convexity 149 12.1 Convexity . . . 149

12.2 Subdifferentials . . . 153

12.3 First-Order Optimality Conditions for Convex Minimization Problems . . . . 155

12.4 Duality . . . 157

12.5 Norms, Dual Norms and the Polar Transformation . . . 162

12.6 Degrees of Convexity . . . 165

12.7 A Characterization of the Set of Convex, Differentiable Functions . . . 169

12.8 Algorithmic Ideas From Convex Optimization . . . 170

12.9 Summary . . . 170

12.10Bibliographical Notes . . . 170

12.11Exercises . . . 170

13 Continuous Exponential Weights 179 13.1 Continuous Exponential Weights Algorithm . . . 180

13.2 Regret in the Non-convex Case . . . 182

13.2.1 Unbounded Domains . . . 185

13.3 Regret Bounds for Convex Problems . . . 187

13.3.1 Regret Bounds for Exp-Concave Losses . . . 187

13.4 Binary Prediction with the Log-loss . . . 191

13.5 An Application to the Stock Market . . . 193

13.7 Exercises . . . 194

14 Follow the Regularized Leader 199 14.1 Online Convex Optimization . . . 199

14.1.1 Linearization . . . 200

14.2 The Follow the Regularized LeaderAlgorithm . . . 201

14.3 Analysis of FTRL . . . 202

(6)

DRAFT

6 CONTENTS

14.3.1 Strongly Convex Regularizer . . . 203

14.4 Legendre Functions and Bregman Projections . . . 204

14.4.1 Implementation of FTRL with Legendre Regularizer . . . 206

14.5 Analysis using Convex Conjugates . . . 207

14.6 Exercises . . . 211

15 Mirror Descent Algorithm 215 15.1 Implementation with Conjugates and Bregman Projections . . . 216

15.2 Regret Bounds for Mirror Descent . . . 217

15.3 Strongly Convex Regularizers . . . 219

15.4 Strongly Convex Losses . . . 221

15.5 Inexact Implementation . . . 222

15.6 Exercises . . . 223

15.7 Bibliographic Notes . . . 223

16 Lower Bounds for Online Convex Optimization 225 16.1 Minimax Regret for Linear Losses . . . 226

17 Linear Classification 229 17.1 Convex Upper Bounds for the Zero-One Loss . . . 230

17.2 Online Support Vector Machine . . . 231

17.3 The Perceptron Algorithm . . . 233

17.4 The p-norm Perceptron . . . 236

17.5 Matching Loss Functions . . . 237

17.6 Pegasos Algorithm . . . 239

17.7 Exercises . . . 242

18 Online Learnability of Classification Problems 245 18.1 Realizable Setting . . . 246

18.2 The Non-Realizable Setting . . . 249

18.3 Exercises . . . 252

19 Linear Least-Squares Predictions 255 19.1 Potential Approaches . . . 255

19.2 Analysis . . . 258

19.3 An Improved Forecaster . . . 262

19.4 Ridge Regression with Projections . . . 264

19.4.1 Analysis of Regret . . . 264

19.5 Directional Strong Convexity . . . 267

19.6 Exercises . . . 268

20 Exp-Concave Functions 271 20.1 Exercises . . . 273

(7)

DRAFT

CONTENTS 7

21 p-Norm Regularizers and Legendre Duals 275

21.1 Legendre Dual . . . 275

21.2 p-Norms and Norm-Like Divergences . . . 277

21.3 Regret for Various Regularizers . . . 280

21.4 Exercises . . . 282

22 Exponentiated Gradient Algorithm 283 22.1 Exercises . . . 286

23 Connections to Statistical Learning Theory 287 23.1 Goals of Statistical Learning Theory . . . 288

23.2 Online-To-Batch Conversions . . . 289

23.3 High-Probability Bounds for Averaging . . . 291

24 Tracking 295

IV Partial Monitoring 297

25 Multi-Armed Bandit Problem 299 25.1 Exponential-Weight Algorithm for Exploration and Exploitation . . . 300

25.2 Bound on the Expected Regret . . . 301

25.3 Old Stuff . . . 302

25.4 Exp3-γ Algorithm . . . 302

25.5 A High Probability Bound for the Exp3.P Algorithm . . . 304

26 Lower Bounds for Bandits 307 26.1 Proof of Theorem 26.1 . . . 307

26.2 Summary . . . 311

26.4 Exercises . . . 312

27 Exp3-γ As FTRL 313 27.1 Black-Box Use of Full-Information Algorithms . . . 313

27.2 Analysis of Exp3-γ . . . 314

27.2.1 Local Norms . . . 315

27.3 Avoiding Local Norms . . . 316

27.3.1 Relaxing the Nonnegativity of Losses . . . 317

27.3.2 An Alternative Method . . . 317

27.4 Exercises . . . 318

28 Notation 325

(8)

DRAFT

8 CONTENTS

V Appendices 327

A Real Analysis 329

B Additional Results in Convex Analysis 331

C A Minimal Subset of Probability Theory 335

C.1 An Intuitive Approach to Probability Theory . . . 335

C.2 Various Definitions . . . 338

C.3 Limit Theorems for I.i.d. Random Variable . . . 338

C.4 Hoeffding’s Inequality . . . 339

C.5 Martingales and Azuma’s Inequality . . . 342

C.6 Shortest Path Problem . . . 344

C.7 Probabilistic Method . . . 345

C.8 Bibliographical References . . . 345

D Miscellanea 347 D.1 Application to Learning to Combine Many Features . . . 347

E Game Theory 349

(9)

DRAFT

Part I

Introduction

(10)

DRAFT

(11)

DRAFT

11 This part introduces the framework used in the book, provides some illustrative examples and some technical results that will be used later.

In particular, the first chapter introduces the online learning problem, together with several illustrative examples. We adopt a game-theoretic view of learning: A forecaster and an environment play a zero-sum sequential game, where the payoff to the forecaster is defined as its negated regret. A learning algorithm is then a strategy for the forecaster and a good learning algorithm is a computationally efficient strategy which is minimax optimal, or near-minimax optimal. The implications of the game theoretic viewpoint are discussed, together with the choices we made and some variants of the basic framework.

The second chapter is an illustration of how the definitions of the first chapter can be used to design learning algorithms. This chapter considers a deliberately oversimplified setting to keep the derivations short. As we will see later, the tools, techniques, results and the reasoning in this chapter will appear later in more interesting learning problems.

The third chapter introduces the game-theoretic framework in a formal way with “all the glory details”. The purpose of this chapter is to provide a solid foundation. We also give some general results of independent interest, concerning the power of various classes of strategies. Questions investigated include whether randomization helps a forecaster, or whether the worst-case environments should be reactive. This chapter can be skipped at the first reading or just skimmed through briefly. The reader could then come back to this chapter later if and when the results presented here are used for the first time.

(12)

DRAFT

12

(13)

DRAFT

Chapter 1 Introduction

T

^he online prediction problem can be formulated as finding a good strategy for a forecaster participating in a two-player sequential game. We will call this game a Regret Minimization Game and we call the two players the environment and the forecaster. We will interchangeably use some other names for the two players.

The environment will be often called nature or an adversary, while we may call the forecaster analgorithm or a learner.

A Regret Minimization Game is a triple (D,L, n) whereD is a non-empty decision set ,L is a set of real-valued functions defined overD,n is the number of rounds of the game.

The elements ofD are calledpredictions and the elements ofL are calledloss functions. The game proceeds in n rounds. In each round t= 1,2, . . . , n, the forecaster chooses a prediction pt from the set D and simultaneously the environment chooses a loss function `t from L. At the end of the round, the loss function `t is revealed to the forecaster and the decision pt is revealed to the environment.

The payoff to the forecaster is −Rn, where

Rn=

n

X

t=1

`t(pt)− inf

u∈D n

X

t=1

`t(u) (1.1)

is the n-step cumulated regret. The payoff to the environment is Rn, that is, the game is zero-sum: whatever the environment wins, the forecaster loses. Clearly, for any given sequence of loss functions `1, `2, . . . , `n, minimizing the regretRn is the same as minimizing the cumulated loss

Lbn =

n

X

t=1

`t(pt).

However, since the sequence (`t) is not known ahead of time, a forecaster interested in keeping its total loss small, does not have the option to minimize its total loss directly. One option then is to minimize the worst-case (or minimax) regret when the environment is treated adversarially. The relation between loss and regret minimization is further discussed in Section1.1.1.

(14)

DRAFT

14 CHAPTER 1. INTRODUCTION

Regret Minimization Game (D,L, n) In each round t = 1,2, . . . , n:

• Forecaster chooses a predictionpt ∈D.

• Environment, simultaneously, chooses a loss function `t ∈ L.

• Forecaster receives function `t, while environment receives pt.

The payoff to the forecaster and the environment after n rounds is −Rn and Rn, respectively, where Rn, the n-step regret, is defined by (1.1).

In game theory, both players are equally important. However, our main goal in this book is to design strategies for the forecaster, and so we will mostly study games from the forecaster’s perspective. When designing a forecaster-strategy, we do not want to assume anything about the sequence of losses seen by the forecaster. In fact, we are interested in forecaster strategies that achieve the best possible payoff even against the worst possible environment. Thus, unlike in statistical learning theory, we make no probabilistic assumptions about the sequence of loss functions seen by the forecaster.

1.1 Minimax Strategies and Learning

If the environment’s strategy was known, the best choice for the forecaster’s strategy would be the one that minimizes the forecaster’s total loss against that strategy. However, the key characteristics of a learning problem is that the environment’s strategy is not known when the game starts. What makes a reasonable forecaster strategy then? A minimax strategy is guaranteed to achieve the best possible payoff that can be guaranteed against any environment. In the lack of further a priori knowledge of the environment’s strategy, a forecaster can adapt a minimax strategy. At times, the minimax strategy is hard to compute.

In those cases we will be interested in finding near-minimax strategies, that is, strategies whose payoff at the end is close to that of a minimax strategy.¹ A (near-)minimax strategy will guarantee a small regret even against the worst possible environment.

In a nutshell, online learning is the art of designing computationally efficient near-minimax strategies for various instances of the Regret Minimization Game.

Why is the payoff-pair defined using the regret? How about simply considering the cumulated loss? What is a regret bound? What is the meaning of regret bounds? What if n, the horizon of the game is unknown, that is, when learning is ongoing. Can the minimax (or near-minimax) strategy depend on n? Could the (near-)minimax strategies be overly conservative and thus perform poorly in practice when the environment exhibits strong

1In this book, we will see some examples when in fact the minimax strategy can be efficiently computed, but this is more the exception than the rule.

(15)

DRAFT

1.1. MINIMAX STRATEGIES AND LEARNING 15

regularities (i.e., the environment diverts from its minimax optimal strategy)? In addition to illustrating the concepts developed so far with some specific examples, answering these questions will make up the rest of the chapter.

1.1.1 Loss Minimization Games

As mentioned before, minimizing the regret is equivalent to minimizing the cumulated loss for a fixed sequence of loss functions. However, when designing a general forecaster strategy (with no assumptions on the environment), a game-theoretic approach to optimizing the

cumulated loss is a dead end.

To see why, consider for a moment the game where the payoffs are defined with the cumulated loss, that is, the payoff of the forecaster is−Lbn, while the payoff of the environment isLbn. Let us call the resulting gamesloss-minimization games. What are the good strategies for the forecaster in loss-minimization games? The first question is whether there exists a strategy that uniformly dominates any other strategy.² For most choices of (D,L), the answer is no, though this is not surprising: if there was a dominating strategy for some game, this would mean that there would exist an overall best prediction independently of the choice of the loss functions (cf. Exercises 1.2and 1.3), making the underlying learning problem dull.

The next question is whether minimax strategies are more interesting. As it happens, for quite sensible games, the minimax strategies can be utterly unreasonable for loss-minimization games. Consider for example the so-called matching pennies problem, where D={0,1} and L={`(0), `(1)} with`(y)(u) =I{y6=u}. The symbol Ihere denotes the indicator function;

I{X}equals 1 whenever the predicateX inside the indicator is true and it equals 0 otherwise.

If the environment tends to use one of the loss functions more often than the other, we would expect a sensible forecaster strategy to pick this pattern up and start to predict the corresponding bit. Yet, as it is required to show in Exercise 1.5, there are minimax strategies for loss-minimization games over (D,L) that fail to do this. In fact, one minimax strategy for this game is to select between the two decisions uniformly at random regardless of what the past observations have been! Thus, the minimax strategy for loss-minimization fails to capture what we expect from an algorithm that “learns”.

The problem with the cumulated loss is that no matter what the forecaster’s strategy is, the environment can force a large cumulated loss on the forecaster by just playing randomly.

As a result, the forecaster can also resort to a random strategy, since the bar is already quite low. To fix this, one should change the payoff structure so as to “raise the bar”: The environment should be limited in how low a payoff it can force on the forecaster (e.g., by playing randomly), while simultaneously making random play incur significantly lower payoff for the forecaster on loss sequences which exhibit some pattern than a clever strategy that looks for those patterns (see Exercise1.4 for details). Changing the payoff of the player to the negative of the forecaster’s regret is one way of achieving this goal.

2Strategy Ais said to dominate strategyA⁰ when for all environments, the expected payoff ofAis at least as large as that ofA⁰. A uniformly dominating strategy dominates all the other strategies.

(16)

DRAFT

1.1.2 Regret

The idea underlying the concept of regret is simple. For a prediction u ∈ D define its cumulated loss in n rounds as

Ln(u) =

n

X

t=1

`t(u).

As a modest goal, we expect a good learning algorithm to perform almost as well as the best fixed prediction in hindsight, which is exactly what is captured by the regret:

Rn=Lbn− inf

u∈DLn(u).

Note that if the losses are shifted by a constant, the regret is unaffected. If we further restrict the loss values to span some fixed interval such as [0,1], the regret can be meaningfully used to compare different choices for (D,L). The sequential game over the pair (D,L) where the payoff for the forecaster is−Rn, while the payoff for the environment isRnis called the regret minimization game over (D,L) and horizon n.

Note that in the regret definition, the forecaster’s predictions do not change the second term. Thus, the forecaster’s goal is still to minimize its cumulated loss. What the second term causes, however, is that the environment is penalized for choosing loss functions where the best prediction in hindsight has a large cumulated loss and, as a result, the environment is limited in controlling the forecaster’s payoff. Then the forecaster is forced to use a pattern seeking strategy, since a random strategy on a loss sequence where the best prediction in hindsight achieves a small loss would result in a payoff that is too low (because then the first term will be large, while the second term will be small). This makes the (near) minimax strategies of the forecaster meaningful learning strategies, while such strategies for the environment will still be simple randomizing strategies. The latter should not be entirely surprising as random environments are intuitively the least predictable.

When the losses lie in a bounded interval, say, [0,1], the range of the regret is [−n, n].

If the average per-round regret _n¹Rn approaches zero asn → ∞, the forecaster’s per-round average loss, ¹_nLbn, approaches the best per-round average loss infu∈D Ln(u)

n , in which case one can say that, in the long run, the forecaster does as well as the best fixed prediction in hindsight. Algorithms with this property are said to have zero-regret. Alternatively, they are called no-regret algorithms. An equivalent way of stating that an algorithm is zero-regret is that Rn grows sublinearly. Symbolically, Rn=o(n).

For a given pair (D,L) and integern >0, the minimax value of theRegret Minimiza- tion Game G =G(D,L, n) with decision set D and loss set L is defined as the regret of the minimax forecaster against the worst environment over a horizon of n steps, that is, the smallest regret achievable by any forecaster, in n rounds, against the environment that is the worst for that strategy. The minimax value of G tells us the difficulty of the gameG. At a deeper level, the goal of studying online learning is to understand how the choice ofD and L influences the difficulty of learning. Let Vⁿ=Vⁿ(D,L) denote the said minimax value. As noted earlier, it is quite rare that a computationally efficient strategy that achievesVⁿ would be known. As a second best goal, one then resorts to designing computationally efficient strategies whose regret is comparable to that ofVⁿ. Of particular interest are strategies that are within a constant factor of Vⁿ.

(17)

DRAFT

1.1. MINIMAX STRATEGIES AND LEARNING 17

A regret bound is anupper bound on the regret of a forecaster in a particular environment.

To evaluate how good the forecaster is, we can compare the bound obtained on its regret to Vⁿ. If the regret bound is within a (hopefully small) constant factor of Vⁿ, the forecaster is deemed to be “near-optimal”. It is important to realize that a larger than ideal regret bound does not necessarily mean that the forecaster is “bad” as the bound itself might be loose. Hence, when the bound is large, the next question is whether the bound is tight and if this is the case, we may look for a better algorithm for the particular problem instance.

Nevertheless, we can always expect a regret bound to be a non-decreasing function of the time horizon, since the worst-case regret of any algorithm, as well as the minimax regret, can be shown to be non-decreasing functions of the time horizon, see Exercise 1.9. Occasionally, there exist no efficient algorithms that are near-optimal. Figuring out what the best regret is that can be achieved with bounded computational resources is an active area of research.

As we will need it later, we also define Rn(u), the regret with respect to some fixed prediction u∈D:

Rn(u) =Lbn−Ln(u).

We will use the notation Lbn,Ln(u),Rn and Rn(u) throughout the whole book.

1.1.3 Unknown Horizons and Other Forms of Adaptivity

A strategy that does not use the horizon as input and can be applied in games of any length is called an anytime strategy.³ Quite often, the minimax strategy will depend on n, the number of interactions. However, in practice, learning can be an “ongoing” process and thus we are not given the value of n.

One immediate difficulty when considering such an ongoing learning process is that there is no guarantee that there exists a single strategy that is simultaneously optimal for each game Gn =G(D,L, n) (cf. Exercise 1.11). Thus, we resort to the second best goal and look for anytime strategies whose worst-case regret is within a constant multiple of the minimax value of Gn on any horizon n. When a strategy A achieves this goal, we will say that A

“adapts” to the unknown horizonn.

In fact, if one is given a sequence of strategies (An) such that for each n, the regret of An

is below B(m) form= 1, . . . , nfor some non-decreasing function B(·) then it is usually not hard to design a strategy which “merges” these strategies into a single anytime strategy A in such a manner that the regret of A for any n is at most a constant multiple of B(n) (note that the assumption that B(m) is non-decreasing is natural, as the worst-case regret of any algorithm, as well as the minimax regret, are non-decreasing functions of the time horizon, see Exercise 1.9). Thus, under appropriate conditions, if B(n)/Vn(Gn)≤C for some C > 0 then the worst-case regret of A for any n will also be within a factor of C⁰ > C of Vn(Gn).

The trick is to partition the time into a sequence of intervals (Tk) of increasing length and use the strategy A^|Tk|, started at the beginning of the interval, for all time stepst when t∈Tk

(here |Tk| denotes the length of the interval Tk). This method of merging (Aⁿ) into a single anytime strategy A is called the “doubling trick,” where the adjective “doubling” comes from

3Our use of the term “anytime” is slightly different than the standard use in Artificial Intelligence, where anytime is meant to denote analgorithm which can be stopped at any time and will still return some result.

Usually, it is expected that the result “improves” if the algorithm is given more time.

(18)

DRAFT

that in a typical scenario the cardinality ofTk is doubled as the index k is incremented. See Exercise1.12 for details. Since at the beginning of each intervalTk, Astarts a new algorithm from scratch, it essentially forgets all the data seen beforehand, and therefore the regret of A exhibits an unpleasant jump after each interval. Although these bumps are guaranteed to be small compared to the regret already suffered, there is still strong desire to develop anytime strategies that do not need to forget data and thus their regret behaves in a smoother fashion.

A related concern is that if the designer of the forecaster believes that the environment can select losses from a set L, while the environment will only select losses from a “much smaller” set L⁰ ⊂ L, would a minimax (or near-minimax) forecaster for (D,L) lose much compared to a minimax forecaster designed to work for (D,L⁰)? Notice that Vⁿ(D,L⁰) could be much smaller than Vn(D,L). More generally, given a parametric set (Lθ)θ of loss-sets, we may ask the question of whether there exists a forecaster that is simultaneously near-minimax for all games (D,L^θ). Such an forecaster is said to be able to adapt to the family (L^θ)θ. We will see several examples for forecaster that exhibit such adaptivity properties.

1.1.4 Generalizations

In the definition of regret the performance is compared to that of the best prediction in hindsight. This best prediction is defined with the help of the set D, which is also the domain of the loss functions. At a first sight, one might think that this is a serious limitation.

However, in many examples the decision set D itself contains complicated elements (such as functions, or algorithms). Thus even though previously we said that the goal of competing with a single element of D is modest, in fact, this model goal may result in very powerful learning algorithms. Nevertheless, there are times when it is better to separate the domain of the loss functions and the set of competitors.

One such situation is when we consider environments which are “inherently nonstationary”

in the sense that we expect no single element of D to be able to achieve a small cumulated loss. In such a case, it makes more sense to modify the regret so that the goal becomes to compete with sequences of decisions:

Rn =Lbn− min

u1:n∈S n

X

t=1

`t(ut). (1.2)

Here u1:n= (u1, . . . , un), where ui ∈D, while S ⊂Dⁿ is a subset of sequences of predictions of length n: The set S should contain those sequences which are expected to achieve a small cumulated loss. The regret defined by (1.2) is called the tracking regret. Similarly to the previous case, one can then study the resulting game, though, naturally, the answers will change. Of particular interest is the issue of adaptivity to structures defined using parametric sequence sets (Sθ)θ, where θ could describe the complexity of the sequences in Sθ, such as the number of times the sequences inS^θ are allowed to change.

Besides the tracking regret, there are a number of other interesting regret concepts, such as the interval regret (closely related to the tracking regret), the internal regret, the policy regret and others.

Another direction of generalizing the framework is to reconsider what information is available to each of the players in each round. The first (slight) departure from the above

(19)

DRAFT

1.2. EXAMPLES 19

model is that in some applications in every round some information about the loss function for that round may be revealed to the forecaster before the forecaster has to make a decision.

As we will see, this is the case when supervised learning problems are considered. Naturally, this can only make the job of the forecaster easier (cf. Exercise1.17). Going in the opposite direction, sometimes, only partial information about the loss function may be revealed to the forecaster. The problem is the most interesting when the information revealed depends on the choice made by the forecaster. Such problems, known as partial monitoring problems will be studied in Part IV. The crux of this problem is that when selecting a decision the forecaster has to consider not only the losses but also how much information can be revealed as a result, leading to the so-called exploration-exploitation dilemma. A special case is when the information revealed is the loss `t(pt), leading to what are called the bandit problems in the literature.

1.2 Examples

Many practical problems can be naturally phrased asRegret Minimization Games. In the rest of this chapter, we will look at some examples. Further examples are discussed in Exercises 1.20 to 1.25.

1.2.1 Prediction of Stock Prices

Assume that every day in the morning we are given information about stocks of a particular company. Our goal will be to predict the closing price of the stock on that day. For this purpose we can use information about the past prices of the selected stock, past prices of stocks of companies from the same industry, financial data about the company, news articles about and press releases of the company amongst other things. The prediction error will be measured as the squared difference between the predicted and actual price (Exercise 1.13 deals with what changes if we change how the prediction error is measured).

Assume that the information for day t (t ∈ N) is encoded as a vector xt ∈ [−1,1]^d, describing the “features” of the day, and let yt ∈ [0, ymax) denote the closing price to be predicted for the same day, whereymax>0 is the largest possible price.

One simple way of predicting yt is to compute the linear combination of the components of xt using some real-valued weights p∈R^d:

ˆ

yt=hxt, pi=

d

X

i=1

pixt,i .

The Regret Minimization Game is then set up by taking D to be some subset of R^d that is expected to contain a weight which makes the predictions accurate. One option is to choose D= R^d, but if one has some a priori information it may be better to restrict D to a smaller subset of R^d (naturally, it is easier to be competitive with a smaller set of predictors). The set of loss functions L is taken to be the set of loss functions ` :D→R of the form `(p) = (y− hx, pi)² where y∈[0, ymax] andx∈[−1,1]^d. With this, the loss function picked by the environment in round t is `t(p) = (yt− hxt, pi)². Presenting the forecaster

(20)

DRAFT

with `t instead of (xt, yt) is essentially the same, as the forecaster should only care about its prediction loss (Exercise 1.15 explores this question).

Thus, it is the same to say thatxt, ytis chosen by the environment or that the environment picks `t ∈ L.

We can give a specific meaning to achieving low regret in this setting: If a forecaster has low regret for this problem, then on the average the accuracy of its predictions are close to the best accuracy that would have been possible if all the prices were known ahead of time but the predictions were restricted to be made using a linear predictor. The regret measures the cumulated difference between the cumulated actual accuracy achieved and the best prediction accuracy in hindsight. Note that this problem is an instance of an online learning problem where some information about the loss function, namely the feature vector xt, is revealed before the forecaster needs to make a decision.

If we disregard the details of where (xt, yt) are coming from, we arrive at the so-called online linear regression problem. The adjective “linear” comes from noting that for any fixed x the mapping p 7→ hx, pi is linear; predicting with such mappings is called linear prediction and the map itself is called a linear predictor. An attractive feature of linear prediction is that it is simple. Since computing a single prediction costs O(d) floating point operations, prediction is cheap at least when d, the number of features (i.e., the dimension of xt) is not “large.” Further, computing the prediction can be sped up if the feature vectors are “sparse,” that is, if they have many zeroes. Further, learning algorithms that use linear predictors often are low cost, too (i.e.,, many algorithms require O(d) floating operations per round). On the other hand, at least at the first sight, that the features must be linearly combined may look as a serious limitation. Despite this, usually after tinkering with what features to use, linear prediction can be made quite powerful. Since in practice often it is hard to know a priori which features to use, it is of significant interest to design algorithms that can tolerate a large number of irrelevant features. Online linear regression will be used throughout the book to motivate the development of several algorithms, while Chapter 19 is entirely devoted to studying this problem. A key feature of this problem is that the losses

`∈ L are convex. This is explored in Exercise1.18.

1.2.2 Click-Through Rate Prediction

Large web sites are often successful because they optimize what content they show to their users based on how often people click on certain pieces of content (news stories, links, advertisement, search results, etc.). When a user goes to the web site, his browser sends a request to a web server, which then sends back the content to the user’s browser. In the background, an automated system predicts, for the given user and a piece of content, the probability of a click, called the click-through rate.

Suppose that the automated system receives a stream of requests x1, x2, . . .. Each request encodes a user-content pair, e.g., the geographic location of the user, type of browser used, previous interaction with the website, type of content delivered. The goal of the automated system is to predict the click-through rate for each query. After the prediction is made, the system receives binary feedback, say yt = +1 if the user clicked, while yt= −1 if the user did not click on the content shown. To maximize the revenue, it makes sense to predict the probability that the user will click on the presented content.

(21)

DRAFT

1.2. EXAMPLES 21

0 1 2 3 4 z

−4 −3 −2 −1

1 2

1

Figure 1.1: The logistic function, σ(z) = 1/(1 +e^−z).

We assume that a query xt is represented as a vector in (say) [−1,1]^d, just like in the previous section. One common way of predicting the click-through probabilities then is to choose a weight vector pt∈R^d and predict ybt=σ(hpt, xti) where σ(z) = 1/(1 +e^−z) is the logistic function (see Fig.1.1).

The logistic function arises once the prediction problem is transformed into predicting odds instead of predicting probabilities. Recall that for some event E that happens with probability q, the odds in favor of E is the ratio q/(1−q), the ration of the probability that the event will happen to the probability that E will not happen. Since odds have a multiplicative nature, it is natural to predict their logarithms (which is “additive”), that is, the “logit”, logit(q) = ln(_1−q^q ), of the probabilities q. Since this is an “additive” quantity and can be any real number, positive or negative, it makes sense to predict it as a linear functionhpt, xtiof the queryxt. The logistic functionσ(z) = 1/(1 +e^−z) is the inverse of the logit function; it maps the logarithm of the odds back to probabilities. Hence, the prediction method.

As is common in statistics when fitting models to predict probabilities, we define the loss of a prediction to be the negative logarithm of the probabilities assigned to the outcomes (thus, penalizing harshly assigning a small probability to the actual outcome):

`t(u) =

(−ln(σ(hu, xti)), if yt= +1 ;

−ln(1−σ(hu, xti)), if yt=−1.

After simplification, we can write the loss function as `t(u) = ln(1 +e^−y^t^hu,x^tⁱ) (cf. Exer- cise 1.19).

Click-through rate prediction can be viewed as a Regret Minimization Game. As in the case of linear regression, the main trick is to encapsulate xt and yt into the loss function. The decision set D is a subset ofR^d. The set of loss functions is L={`(x,y) : x∈ [−1,1]^d, y ∈ {+1,−1}}, where `(x,y) : D →R is defined as `(x,y)(u) = ln(1 +e^−yhu,xi). The resulting problem is called online logistic regression.

1.2.3 Maximizing Click-Through Rates Interactively

Note that in the previous example we assumed that the content being served is chosen by some other service in the system. Let us now consider how this service could be optimized. For simplicity let us assume that there are finitely many different page-contents to be shown to the user and the choices do not change over time (clearly, this is a major oversimplification). Index these contents with integers from 1 to K. Let the user-information presented to the service before the decision must be made about what to serve come from the set D. The problem

(22)

DRAFT

a b

Figure 1.2: Illustration of the network routing problem. The goal is to send data from node a to node b with the minimum latency, where in each time step, the global network traffic determines the latency. A possible route froma to b through the network is shown with thick lines.

can be phrased as that of finding the best predictor from D ⊂ {p : p:X → {1, . . . , K}}

for some choice of D so as to maximize the number of clicks. Since the predictors in D map the user-information x ∈ X to one of finitely many values, a predictor p ∈ D can be said to classify the users into one of the K categories according to which content they prefer. If the forecaster chooses the predictor pt, the prediction of what to serve is ˆ

yt = pt(xt), where xt ∈ X is the user information for round t. If the user clicks on the element represented by yt on the webpage, the forecaster incurs no loss, otherwise it incurs a loss of one. If Ct ⊂ {1, . . . , K} is the list of contents that the user of round t would have clicked on, the loss function takes the form `t(p) = I{p(xt)6∈Ct}, that is, we can take L ={` :D→ {0,1} : `(p) =I{p(x)6∈C}, x∈X, C ⊂ {1, . . . , K}}. If the set Ct was revealed at the end of round t, the problem would be an instance of online classification.

However, in realistic applications the user cannot be inquired about his preferences. In other words, Ct will remain hidden and the forecaster will only learn whether the user has clicked on the particular content presented. That is, the only feedback to the forecaster is the loss of its choice, `t(pt). Thus, this is an instance of a bandit problem, more specifically it is an instance of the contextual bandit problem. We will consider bandit and other partial monitoring problems in Part IV.

1.2.4 Network Routing

.

Suppose we are operating a computer network and we would like to send packets between two specific computers. We have to choose a route, that is, a set of nodes and links through which we send the data. Since it is hard to know the quality (e.g., the latency or throughput) of the links in the network in advance, we are looking for an adaptive algorithm that learns to pick the best route.

We can think of the computer network as a finite (undirected) graph G= (V, E), whereV is the set of vertices and E is the set of edges. The vertices represent the computers, routers and other nodes in the network, and the edges represent direct links between nodes. Let

(23)

DRAFT

1.2. EXAMPLES 23

a, b∈V be the two specific computers between which we want to send the data, see Fig. 1.2.

A path from a to b can be viewed as a subset ofE. Thus, the decision set D⊆2^E is the set of all paths froma tob. In roundt, we choose a pathpt ∈D, and for each edge e∈E, the environment assigns a cost wt(e)∈[0,1] to e representing the “cost” of sending data along edge e. For example, wt(e) can be a function of the latency or throughput of link e. The cost of sending data along a path p∈D is the sum of the costs on its edges: that is, the loss function `t:D→R for round t is

`t(p) =X

e∈p

wt(e).

Thus, L=n

`(w) :E →[0,1] : `(w)(p) =P

e∈pw(e)o .

Note that every path can be represented as a vector u∈ {0,1}^E, such that u(e) = 1 if e∈pand u(e) = 0 otherwise. For a given pathp∈D, let us denote this vector by u(p); then the loss function `t can be written in a linear form `t(p) = hu(v), wti. In other words, the problem can be reformulated so that the decision set becomes J =

u(p)∈ {0,1}^E : p∈D , the set of all E-vectors associated with the paths in G, and for any u∈J, the loss function

`(w) is defined as `(w)(u) = hu, wi.

If we allow the algorithms to choose a path randomly, we can represent the decisions as elements of a convex set K ⊂ [0,1]^E which is obtained as a convex combination of the elements of J: Let K = n

P

p⁰∈Dα(p⁰)u(p⁰) : α(p⁰)≥0,P

p⁰∈Dα(p⁰) = 1o

. From this it is evident that K is indeed a convex set. Then, given w:E →[0,1] we extend`(w) to K: for p =P

p⁰∈Dα(p⁰)u(p⁰) ∈K we define `(w)(p) =P

p⁰∈Dα(p⁰)hu(p⁰), wi =hp, wi. It is easy to see that`(w)(p) is the expected cost of routing through the network when the costs of the links are given by w, while path u(p⁰) is selected with probability α(p⁰). Further, `(w) :K →[0,1]

is well-defined (i.e., if p has different decompositions, the above expression will still give the same value; this follows from the second form of `(w)(p)). In summary, with this extension we see that an online routing problem can be specified using a pair (K,L⁰), where K is a convex polytope andL⁰ =

`(w):K →[0,1] : `(w)(p) =hp, wi, w ∈[0,1]^E is a set of linear loss functions. Online learning problems of this structure are calledlinear prediction problems with linear losses. These problem form the cornerstone of online learning as many other problems can be reduced to them. They will be extensively studied in Part III.

In a networking application, the traffic through the network is expected to go through several changes. To deal with such an inherently nonstationary setting, it may make more sense to study these problems under the tracking regret (cf. (1.2)).

1.2.5 Compression

Suppose we want to compress a stream of data. For example, we might want to compress the sound of a live concert and send it over the internet. Or we might want to compress a very big file, but the file is so big that we cannot afford to read it twice and we must compress it on the fly.

Let us assume that the data stream is a sequencea1a2. . . of symbols from a finite alphabet Σ. We encode each symbol of the stream as a binary string; different symbols are encoded

(24)

DRAFT

possibly by strings of different length. To encode the tth symbolat of the stream, we choose an encoding scheme pt: Σ→ {0,1}^∗ (before at is revealed) and encode at as the binary (or bit) string pt(at). The compressed stream will be a binary string p1(a1)p2(a2). . . resulting from concatenation of the codes pt(at) of individual symbols. The goal is to ensure that the length of the compressed data is as small as possible. To be able to decompress, we assume that pt is prefix-free, which means that for any two distinct symbols a, b∈Σ,pt(a) is not a prefix of pt(b) (strings is a prefix of string r is the first |s| characters ofr are identical to the characters of s, where|s| gives the length of s). ⁴

But is it enough to guarantee decompression that each encoding scheme on its own is prefix-free? As long as the compression algorithm is deterministic, or it shares the sequence of random bits with the decoding algorithm, the answer is yes. The key is that the decompression algorithm can emulate the choices that the compression algorithm makes, thus, at any given time it will know what encoding scheme is being used. Knowing what the encoding schemept

is, the decompression algorithm upon processing the remaining stringst= pt(at)pt+1(at+1). . . can proceed as follows: Suppose that b1b2. . . is the sequence of bits ofst. To find the next symbol at, the decompressor will try all possible symbols a∈Σ and test if pt(a) is a prefix of b1b2. . .. Because pt is prefix-free, precisely one symbol will satisfy the test. Suppose pt(at) =b1b2. . . bk. The decompressor then removes the first k bits from the beginning of the compressed stream and defines st+1 =bk+1bk+2. . .. Next, knowing at, the decompressor can emulate the behavior of the compressor to determine pt+1. The process then continues by decoding st+1.

Online compression can be phrased as a Regret Minimization Game. We define D to be the set of all prefix-free functions of the form u: Σ→ {0,1}^∗. A natural loss function is the length of the compressed data. Formally, the loss function is defined as `t(u) =|u(at)|. The set of loss functions to which `t belongs is L =

`(a) : a∈Σ where `a : D → N is defined by `(a)(u) = |u(a)|.

1.2.6 Supervised Learning

Logistic regression and linear regression are both supervised learning problems. In supervised learning, for a given context (sometimes also called input, or covariate) xt ∈ X, the goal is to predict an outcome yt∈ Y. The algorithm predicts byt ∈Yb and incurs loss `(yt,ybt), where

`:Y ×Y →b R is a fixed loss function. For example, in regression problems, both Y and Yb are subsets of R and the loss is quadratic: `(yt,ybt) = (yt−ybt)², while in (binary) logistic regression, Y = {0,1}, Yb = [0,1] and the loss is the negative logarithm of the predicted

4More generally, one only needs to assume that the coding scheme is uniquely decodable, that is, if p1(a1). . . pm(am) = p⁰₁(a⁰₁). . . p⁰_n(an) where p1, . . . , pm are the codes selected to encode a1. . . am, and p⁰₁, . . . , p⁰_n are the codes selected to encode a⁰₁. . . a⁰_n, then m = n and a_i = a⁰_i,1 ≤ i ≤ n. However, for non-adaptive encoding schemes where the coding scheme uses the same code for each symbol, that is, p=p₁=· · ·=p_mfor any data stream a₁. . . a_m, one can show thatpis uniquely decodable if and only if there exists a prefix-free codep⁰ such thatpandp⁰ assign codewords of the same length to each symbol in Σ, that is|p(a)|=|p⁰(a)|for alla∈Σ; this is known as the Kraft–McMillan theorem, see, e.g., (Cover and Thomas,2006, Theorem 5.5.1). This observation motivates us to consider prefix-free codes only.

(25)

DRAFT

1.2. EXAMPLES 25

likelihood

`(yt,ybt) =

(−ln(ybt), if yt= 1 ;

−ln(1−byt), if yt= 0.

The forecaster’s prediction is based on some model p from a class of models D. The class D is often referred to as a hypothesis class, a concept class, or a function class. Formally, D is a class of functions from X to Yb, and the algorithm chooses a model pt∈ D and predicts ybt=pt(xt).

A model is often, but not always, represented by a weight vector or a parameter vector, and thus we can think of D as a subset of R^d and p as a vector and not a function; this includes logistic and linear regression. On the other hand, some model sets, such as that of the sets of decision trees or decision lists have a more complicated structure. When Y is discrete, we talk about classification problems. The base case when Y has two elements is calledbinary classification, and the case when the cardinality of Y is unrestricted is called multiclass classification.

Supervised learning can also be phrased as a Regret Minimization Game. The class of modelsD is the set from which the forecaster chooses its predictions, and the loss function in round t is defined as `t:D→R as`t(p) =`(yt, p(xt)). The loss function `t encodes both the feature vector xt and the correct label yt. Formally, the class of loss functionsL is the set

`(x,y) : x∈ X, y ∈ Y where `(x,y):D→R is defined as `(x,y)(p) =`(y, p(x)).

While we mentioned several places where supervised learning with a continuous label set Y will be studied, classification problems will be studied in Chapters 17and 18.

As before, phrasing supervised learning problems as a regret minimization problem is natural when data arrives in a stream and predictions have to be made in a sequential order.

However, the online learning formulation is also useful for standard batch learning problems with a large amount of data. In these big data problems it is often computationally infeasible to process data points more than once, and the sequential algorithms resulting from the regret minimization framework can achieve the performance of standard learning methods in the stochastic setting, with substantially smaller computational complexity (cf. Chapter23).

In particular, as of today, the state-of-the-art training method for support vector machines is based on such algorithms (cf. Chapter 17).

1.2.7 Unsupervised Learning

In statistical applications, one is often interested in discovering regularities, patterns, or the structure of a sample (x1, x2, . . . , xn). There are many methods for such unsupervised discov- ery, such as clustering, subspace estimation, manifold estimation, independent component analysis, or density estimation, to name a few.

Consider, for example, density estimation. The online variant of this problem is to choose in round t a probability density function pt from a class of densities D overX ⊂R^d. Recall that a probability density function over X ⊂ R^d is a nonnegative valued function p:X →[0,∞) such thatpintegrates to 1 overX: R

Xp(x)dx = 1 (ifXis a finite set, Dwould be a subset of all probability mass functions over X, i.e., for any p∈D,p:X →[0,1] and P

x∈Xp(x) = 1). In roundt, the environment chooses a pointxt ∈ X and the algorithm incurs loss which is the negative logarithm of the density value assigned to xt by pt, −ln(pt(xt)).

(26)

DRAFT

Formally, the set of loss functions L isL =

`(x) : x∈ X , where `(x) :D→R is defined by

`(x)(p) =−ln(p(x)).

1.3 Summary

In this chapter we have defined Regret Minimization Games. These are sequential games played by a forecaster and an environment and specified by a pair (D,L), whereD is a decision set, while L is a set of loss functions with domain D. In a round of the game, simultaneously the forecaster and the environment both choose respective elements of D and L. Next, they are told each others’ choices. If pt is the choice of the forecaster and `t is the choice of the environment, `t(pt) is called the loss of the forecaster for round t. If the game is stopped after n rounds, the payoff to the environment is the regret of the forecaster, while the payoff to the forecaster is the negative regret. The regret is defined as the excess cumulated loss of the player over the best decision in hindsight. A minimax strategy for the forecaster chooses the predictions such that the regret of the forecaster is the smallest possible even against the worst possible environment. In online learning the goal is to design computationally efficient near-minimax strategies, as well as to understand the limitations of such strategies.

We have discussed the rationale of defining online learning based on the notion of regret and we contrasted regret-minimization games with cumulated-loss minimization games. We discussed the role of the horizonn, anytime forecasters and, more generally, forecasters that are able to adapt to some unknown parameter. We also considered some generalizations of the basic framework, including problems where the goal is to compete with a good (low-complexity) sequence of predictions, or problems with an altered information structure.

Several examples borrowed from finance, operations research and information theory were used to illustrate the concepts and we have briefly touched upon relationships to standard machine learning problems, such as supervised and unsupervised learning. Further relationships to optimization, stochastic approximation and risk-minimization algorithms are explored in the exercises.

A formal definition of the concepts that appeared in this section, such as the strategy of a forecaster or the value of a game, will be given in Chapter 3.

1.4 Bibliographic References

The idea of comparing performance of an online algorithm to the best offline solution is called competitive analysis. The book by Borodin and El-Yaniv (1998) is a comprehensive introduction to the subject. Recent work by Buchbinder et al. (2012);Andrew et al. (2013) builds a bridge between online learning and competitive analysis. A good short book on basic concepts of game theory is the book of Leyton-Brown and Shoham (2008) while those who want to delve further into the subject can read the “full” version (Shoham and Leyton-Brown, 2009).

The first use of regret to evaluate the performance of predictors goes back at least to the 1950s. In particular, the concept of regret has appeared in various fields, such as

(27)

DRAFT

1.5. EXERCISES 27

statistical decision theory, game theory, control theory and others. The minimax criterion is central to Wald’s decision theory (Wald, 1949) who has also played a pioneering role in developing sequential methods for decision making (Wald,1947). The sequential prediction of individual (binary) sequences, building on the theory of compound sequential decision problems due toRobbins(1951); Hannan and Robbins(1955) and Blackwell’s approachability (Blackwell, 1956a,b), was studied by Hannan (1957a); Cover (1965); Cover and Shanhar (1977). Compression of individual sequences was considered byZiv and Lempel (1977); Ziv (1978, 1980) who developed a universal compression method that compresses any sequence almost as well as the best finite automaton. Feder et al. (1992) showed how the Lempel-Ziv compressor can be used for prediction, significantly extended the scope of the earlier results concerning sequence prediction. Foster (1991) considered the problem of competing with the best predictor in a finite set where the predictions are real numbers and the accuracy is measured using the squared loss. In the machine learning literature, the study of these models started with the work of Santis et al. (1988); Vovk (1990); Littlestone and Warmuth (1994a). The last few years have seen a true explosion of work in this area, as witnessed by the plethora of papers devoted to this subject in the recent proceedings of the flagship machine learning conferences such as NIPS, ICML, or the theory oriented COLT and ALT.⁵ The recent book of Cesa-Bianchi and Lugosi (2006) is a good overview of the subject up until its publication in 2006. Ealrier, in their book Shafer and Vovk (2001) demonstrate how probability theory, widely considered as providing the foundation for statistics and machine learning, can be based on game theory, while the book of Vovk et al. (2005) considers online learning both in a standard probabilistic sequential framework, as well as in more general frameworks where the probabilistic assumptions are gradually relaxed and replaced by other assumptions.

The game theoretic viewpoint of online learning is thoroughly explored in recent works byAbernethy et al. (2008, 2009);Rakhlin et al. (2012b, 2011b,a, 2012a).

Two excellent recent books devoted to the classical, probability based approach to machine learning are by Abu-Mostafa et al. (2012) and Mohri et al. (2012).

1.5 Exercises

Exercise 1.1. (Regret is Invariant under Shifting the Loss Functions) For any 1≤t≤n, let `t:D→Rbe loss functions for some domainD,bt∈Rbe arbitrary constants, and `⁰_t = `t+bt. Then the regret of any sequence of predictions p1, . . . , pn ∈ D measured with `t is the same as the regret measured with`⁰_t, that is, for any u∈D,

n

X

t=1

(`t(pt)−`t(u)) =

n

X

t=1

(`⁰_t(pt)−`⁰_t(u)) .

5NIPS stands for “Neural Information Processing Systems,” ICML stands for the “International Conference on Machine Learning,” COLT stands for “Computational Learning Theory,” while ALT stands for “Algorithmic Learning Theory.”

(28)

DRAFT

Exercise 1.2. (Dominating Strategies in Loss-Minimization Games: Part I) Fix a pair (D,L) where D is finite. Show that if there exists a dominating strategy for the loss-minimization game specified by (D,L) then there exists an overall best predictionp^∗ ∈D independently of the choice of the loss function (i.e.,`(p^∗) = minp∈D`(p) for any` ∈ L).

* Exercise 1.3. (Dominating Strategies in Loss-Minimization Games: Part II^∗) Extend the result of Exercise1.2 to the case whenD= [0,1].

Exercise 1.4. (Loss Minimization vs. Regret Minimization) Consider the so-called matching pennies problem, whereD ={0,1}and L={`(0), `(1)} with `(y)(u) = I{y6=u}. (a) Show that if the environment selects `t uniformly at random fromL, independently of

the previous choices of both the environment and the forecaster, the expected loss of any forecaster is 1/2 in every round.

(b) Show that if the forecaster chooses its predictions uniformly at random from D, independently of the previous choices of both the environment and the forecaster, its expected loss is 1/2 in every round for any environment. Conclude that such a forecaster (which does not learn anything) is minimax optimal for the loss-minimization game.

(c) * Are there any other minimax strategies?

(d) Show that the worst-case expected regret of any deterministic forecaster is betweenn/2 and n.

(e) Show that if the environment is i.i.d., that is, it selects the loss functions from the same distribution in every round, independently of both the forecaster’s and its previous choices, the forecaster that selects pt = argmin_p∈{0,1}Pt−1

s=1`t(p) for all t > 1 incurs sublinear regret, that is, Rn/n→0 with high probability, in expectation, and almost surely. (In Chapter 6we will see algorithms whose expected regret on this problem grows sublinearly with n without any assumptions on the environment.)

Exercise 1.5. (Linear Growth of the Cumulated Loss: Part I) Consider the setting of Exercise 1.4.

(a) Show that for any deterministic forecaster there exists a sequence of loss functions in L such that after any number of rounds n, the forecaster’s lossLbn is exactly n.

(b) Show that for any (possibly randomized) forecaster and any n there exists a sequence of loss functions such that E[Lbn]≥n/2.

Exercise 1.6. (Linear Growth of the Cumulated Loss: Part II) Assume that L contains two loss functions `(a), `(b) such that c= infp∈D

`(a)(p) +`(b)(p) >0. Show that for any (possibly randomized) forecaster and any n there exists a sequence of loss functions such thatE[bLn]≥cn/2. Assuming that L has no other loss functions than `(a) and `(b) and the minimizers of `(a) and `(b) exist, show that the expected cumulative loss of the forecaster

(29)

DRAFT

1.5. EXERCISES 29

that chooses between the minimizers of`(a) and`(b) uniformly at random satisfies E[Lbn]≤c⁰n for some c⁰ > 0 which only depends on L. The upper bound must be valid against any environment. Make the constant c⁰ as tight as possible and show that further tightening of the bound is not possible.

Exercise 1.7. (Some Problems Are Too Hard)

(a) Unbounded loss functions: Construct an example where the decision set is D={0,1}, yet the worst-case regret of any forecaster grows linearly with the horizonn.

(b) Unbounded decision sets: Construct an example where the losses are restricted to take on values in a bounded interval (say, [0,1]), yet the worst-case regret of any forecaster grows linearly with the horizon n.

Exercise 1.8. (Negative Regret) Give an example when the regret of a forecaster is negative. The worst-case regret of a forecaster is the regret of the forecaster against the environment who plays a strategy that maximizes the forecaster’s expected regret. Can the worst-case regret of a forecaster be negative (note that both players may randomize, in which case the worst-case regret, as opposed to the worst-case expected regret, is random). Can the worst-case expected regret of a forecaster be negative?

Exercise 1.9. (Non-Decreasing Regret)

(a) Show that the worst-case regretRn of any fixed algorithm is a non-decreasing function of the time horizon n.

(b) Using Part a, show that the minimax regret Vⁿ is also a non-decreasing function of the time horizon n.

** Exercise 1.10. (Minimax Expected Regret) Fix a decision set D and a set of loss functions L over D. Consider the regret-minimization games Gn =G(D,L, n), n∈N. Can the minimax expected regret, Vn =Vn(D,L) decrease as a function of n? Why or why not?

Prove your claim.

* Exercise 1.11. (Unknown Horizons) Fix a decision setD and a set of loss functionsL overD. Consider the regret-minimization gamesGn=G(D,L, n), n∈N. Give an example when there is no anytime strategy that is simultaneously minimax for all the games (Gn)n∈N. What if we consider finitely many games only (i.e., 1 ≤ n ≤ N for some N > 0)? Give a nontrivial example when there exists a simultaneously minimax strategy for all the games (Gn)n∈N (in the context of this problem a game is considered trivial if there is a dominating

strategy for the forecaster).

Exercise 1.12. (Doubling Trick) Fix a pair (D,L) of decision and loss-sets, let Aⁿ be a strategy for the n-horizon gameGn=G(D,L, n). Assume that the worst-case regret of An

when used inGnform ≤nrounds is bounded byB(m), whereB is a non-decreasing function, while nothing is known about the regret of Aⁿ when used for m > n rounds. In this problem

Online learning: Algorithms for Big Data