Extended dynamic programming: technical details

with probability at least 1−δ/L. Putting everything together, the union bound implies that we have, with probability at least 1−2δsimultaneously for alll=1, . . . ,L,

t=1

x∈Xl

¡µ˜t(x)−µt(x)¢

≤

l−1

k=0

³p 2+1´

T|Xk||Xk+1||A|lnT|X||A| δ +

l−1

k=0

|Xk| s

2TlnL δ

≤

³p 2+1´

L−1

k=0

1 L

T|Xk||Xk+1||A|lnT|X||A| δ +

l−1

k=0

|Xk| s

2T lnL δ

≤ ³p 2+1´

L s

T|A| µ|X|

¶2

lnT|X||A| δ + |X|

s 2TlnL

= ³p 2+1´

|X| s

T|A|lnT|X||A| δ + |X|

2T lnL

δ (5.10)

where in the last step we used Jensen’s inequality for the concave functionf(x,y)=p

x y(a+lnx) with parametera>0 and the fact thatP_L−1

k=0|Xk| = |X| −1< |X|.

Summing up for alll=0, 1, . . . ,L−1 and taking expectation, using thatvt(πt)−˜vt≤Land (5.10) holds with probability at least 1−2δ, finishes the proof.

The theorem can be obtained by a trivial combination of Lemmas 5.2, 5.3, and 5.5 and applying the bound

L−1

l=0

ln (|Xl||A|)≤Lln

µ|X||A| L

in the last term of Lemma 5.2.

Algorithm 5.2Extended dynamic programming for finding an optimistic policy and transition model for a given confidence set of transition functions and given rewards.

Input:empirical estimate ˆP of transition functions,L1boundb∈(0, 1]^|^X^||^A^|, reward function r∈[0, 1]^|^X^||^A^|.

Initialization:Setw(xL)=0.

Forl=L−1,L−2, . . . , 0 1. Letk = |Xl+1|and ¡

x^∗₁,x^∗₂, . . . ,x^∗_k¢

be a sorting of the states inXl+1 such that w(x₁^∗)≥ w(x₂^∗)≥ · · · ≥w(x^∗_k).

2. For all(x,a)∈Xl×A

(a) P^∗(x₁^∗|x,a)=min©Pˆ(x^∗₁|x,a)+b(x,a)/2, 1ª (b) P^∗(x_i^∗|x,a)=Pˆ(x^∗_i|x,a) for alli=2, 3, . . . ,k. (c) Setj=k.

(d) WhileP

iP^∗(x_i^∗|x,a)>1do i. SetP^∗(x^∗_j|x,a)=max©

0, 1−P

i6=jP^∗(x^∗_i|x,a)ª ii. Setj=j−1.

3. For allx∈Xl

(a) Letw(x)=maxa©

r(x,a)+P

x⁰P^∗(x⁰|x,a)w(x⁰)ª . (b) Letπ^∗(x)=arg max_a©

r(x,a)+P

x⁰P^∗(x⁰|x,a)w(x⁰)ª . Return:optimistic transition functionP^∗, optimistic policyπ^∗.

Chapter 6

Online learning with switching costs

In this chapter we study a special case of online learning in reactive environments where switch-ing between actions is subject to some additional cost. The precise protocol of the prediction problem is identical to the online prediction protocol shown on Figure 2.1, with the additional assumption that every time the learner selects an actionat 6=at−1, a known cost ofK >0 is deducted from the earnings of the learner. We are interested in algorithms that can minimize regret under this additional assumption, or, equivalently, minimize the regret while keeping the number of action switches low. However, the usual forecasters with small regret—such as the exponentially weighted average forecaster or theFPLforecaster with i.i.d. perturbations—may switch actions a large number of times, typicallyΘ(n). Therefore, the design of special fore-casters with small regret and small number of action switches is called for. In this chapter, we consider this problem in the full information setting where reward functions are revealed after each time step. Our results are summarized in the following thesis.

Thesis 4. Proposed an efficient algorithm for online learning with switching costs. Proved per-formance guarantees for Settings 3a and 3b. The proved bounds for Setting 3a are optimal in all problem parameters. Our algorithm is the first known efficient algorithm for Setting 3b. [30, 31]

While this learning problem is very interesting on its own right, one can imagine numerous applications where one would like to define forecasters that do not change their prediction too often. Examples of such problems include the online buffering problem described by Geulen et al. [39] and the online lossy source coding problem to be discussed in Chapter 7. Further-more, as seen in Chapter 4, abrupt policy switches can also be harmful in the online MDP problem. While the core idea of the analysis presented in Chapter 4 is guaranteeing that the learner’s policies changeslowlyover time, it is very easy to see that one can achieve similar re-sults by ensuring that the learner changes its policiesrarely. The main reason that we took the first route for the analysis is that there are currently no known algorithms that can guarantee that the number of action switches isO⁽^p^T). Preliminary results ([24]) suggest that even the existence of such prediction algorithms is nontrivial.

As mentioned before in Chapter 1, learning in this problem can be regarded as a special case of the online MDP problem. The Markovian environment describing the current setting is deterministic and it is easy to find policies that induce periodic state transitions. In fact, the only policies that admit stationary distributions are the ones that satisfyπ(x)=xfor some x∈A. This implies that©

X,A,Pª

do not satisfy Assumption M1, so we cannot guarantee that theMDP-E algorithm (Algorithm 4.1) performs well in this problem by straightforward appli-cation of Theorem 4.1. Recently, Arora et al. [3] proposed an algorithm for regret minimization in deterministic MDPs with non-stationary rewards. Since they consider a significantly more complicated problem where the optimal policy is allowed to induce periodic behavior, they can only guarantee an expected average regret of ˜O(T^−1/4).

Since our general tools introduced in the previous chapters are not directly applicable for this specific setting, we turn to algorithms that are directly tailored to deal with the above prob-lem ofswitch-constrained online prediction. The first paper to explicitly attack this problem is by Geulen et al. [39], who propose a variant of the exponentially weighted average fore-caster called the “Shrinking Dartboard” (SD) algorithm and prove that it provides an expected regret ofO(p

Tln|A|), while guaranteeing that the expected number of switches is at most O⁽^p^T^ln|A|). A less conscious attempt to solve the problem is due to Kalai and Vempala [63]

(see also 64); they show that the simplified version of the Follow-the-Perturbed-Leader (FPL) algorithm with identical perturbations (as described above) guarantees anO(p

Tln|A|) bound on both the expected regret and the expected number of switches.

In the first half of this chapter, we present a modified version of the SD algorithm that enjoys optimal bounds on both the standard regret and the number of switches. Our contri-bution, presented in György and Neu [48] and György and Neu [49], is a minor modification of theSD algorithm that allows using adaptive step-size parameters. More importantly, in the second half of the chapter representing our works Devroye et al. [30, 31], we propose a method based onFPL in which perturbations are defined by independent symmetric random walks.

We show that this intuitively appealing forecaster has similar regret and switch-number guar-antees asSD andFPL with identical perturbations. A further important advantage of the new forecaster is that it may be used simply in the more general problem ofonline combinatorial—

or, more generally,linear—optimization. We postpone the definitions and the statement of the results to Section 6.2.3 below.

Before presenting our algorithms, we set the stage for Chapter 7 by slightly changing our no-tations. For a number of practical problems, including the online lossy source coding problem, it is more suitable to regard the learner’s task as having to minimize losses instead of having to maximize rewards. In accordance with the notations used by Cesa-Bianchi and Lugosi [25], we identify the elements of the action setAwith the natural numbers {1, 2, . . . ,N} and denote the loss given for choosing actioni∈{1, 2, . . . ,N} at timet by`i,t. We assume that`i,t∈[0, 1]. The goal of the learner to choose its actionsI1,I2, . . . ,IT so as to minimize its total expected regret

Algorithm 6.1The modified Shrinking Dartboard algorithm

1. Setηt>0 withηt+1≤ηt for allt =1, 2, . . .,η0=η1, andL_i,0=0 andw_i,0=1/N for all actionsi∈{1, 2, . . . ,N}.

2. fort=1, . . . ,T do

(a) Setwi,t=_N¹e^−η^t^L^i,t⁻¹ for alli∈{1, 2, . . . ,N}.

(b) Setp_i,t=PN^w^i,t

j=1wj,t for alli∈{1, 2, . . . ,N}.

wIt−1,t

wIt−1,t−1, setIt=I_t−1ift≥2, that is, do not change expert; other-wise chooseItrandomly according to the distribution©

p_1,t, . . . ,p_N,tª . (e) Observe the losses`i,tand setLi,t=L_i,t−1+`i,tfor alli∈{1, 2, . . . ,N}.

end for

defined as

Lb_T₌

t=1

`It,t− min

i∈{1,...,N}

t=1

`i,t.¹ Further, define the number of action switches up to timenby

Cn= |{1<t≤n:It−16=It}|.

We are interested in defining randomized forecasters that achieve a regretLb_T of the order O(p

TlnN) while keeping the number of action switchesCnas small as possible.

6.1 The Shrinking Dartboard algorithm revisited

In this section, we present a modified version of the Shrinking Dartboard (SD) algorithm of Geulen et al. [39]. A modified version of this prediction method, called the modifiedSD(mSD) algorithm, is shown as Algorithm 6.1. The difference between theSD and themSD algorithms is thatmSD is horizon independent, which is achieved by introducing the constantct in the algorithm (settingηt=ηthemSD algorithm reduces toSD).

To see that themSD algorithm is well-defined we have to show thatct wi,t

wi,t−1 ≤1 for alltand i. Fort=1, the statement follows from the definitions, sincec1=1. Fort≥2 it follows since

wi,t

wi,t−1=exp¡

ηt−1Li,t−2−ηtDi,t−1¢

≤exp¡¡

ηt−1−ηt¢

L_i,t−2−ηt`i,t−1¢

≤exp¡¡

ηt−1−ηt¢ (t−2)¢

=1/ct.

1One can easily go back and forth between this notation and the one used in previous chapters by using the transformation`i,t=1−r_t(i) for all time steps and actions. Note that the regret is invariant under this transforma-tion.

Note that the only difference between themSD and the EWA prediction algorithms is the presence of the first random choice in step 2d ofmSD: while the EWA algorithm chooses a new action in each time steptaccording to the distribution {p1,t, . . . ,pN,t}, themSD algorithm sticks with the previously chosen action with some probability. By precise tuning of this prob-ability, the method guarantees that actions are changed over time only at mostO(p

T) times in T time steps, while maintaining the same marginal distributions over the actions as the EWA algorithm. The latter fact guarantees that the expected regret of the two algorithms are the same.

In the following we formalize the above statements concerning themSD algorithm. The next lemma shows that the marginal distributions generated by themSD and the EWA algo-rithms are the same. The lemma is obtained by a slight modification of the proof Lemma 1 in [39].

Lemma 6.1. Assume the mSD algorithm is run with ηt+1 ≤ηt for all t =1, 2, . . . ,T . Then the probability of selecting action i at time t satisfiesP[It=i]=pi,t for all t =1, 2, . . . and i∈{1, 2, . . . ,N}.

Proof. We will use the notationWt=PN

i=1wi,t. We prove the lemma by induction on 1≤t≤T. Fort =1, the statement follows from the definition of the algorithm. Now assume thatt ≥2 and the hypothesis holds fort−1. For alli∈{1, 2, . . . ,N}, we have

P[It=i]=P[I_t−1=i]c_t wi,t

w_i,t−1+p_i,t XN j=1

P£

I_t−1=j¤ µ

1−c_t wj,t

w_j,t−1

=p_i,t−1ct

wi,t

wi,t−1+pi,t N

j=1

p_j,t−1 µ

1−ct

wj,t

wj,t−1

=ct

w_i,t−1 Wt−1

wi,t

wi,t−1+wi,t

Wt N

j=1

w_j,t−1 Wt−1

µ 1−ct

w_j,t wj,t−1

=ct

wi,t

Wt−1+wi,t

Wt −ct

wi,t

Wt−1 =wi,t

Wt =pi,t.

As a consequence of this result, the expected regret ofmSD matches that of EWA, so the performance bound of EWA holds for themSD algorithm as well [39, Lemma 2]). That is, the following result can be obtained by a slight modification of the proof of [41, Lemma 1]. We include the proof for completeness.

Lemma 6.2. Assumeηt+1≤ηt for all t=1, 2, . . . ,T . Then the total expected regret of themSD algorithm can be bounded as

E£ Lb_T^¤_≤

t=1

ηt

8 +lnN ηT

. Proof. Introduce the following notation:

w_i,t⁰ = 1

Ne^−η^t⁻¹^L^i,t⁻¹,

whereL_i,t−1=Pt−1

s=1`i,s. Note that the difference betweenwi,tandw_i,t⁰ is thatηt is replaced by ηt−1in the latter. We will also useW_t⁰=PN

i=1w_i,t⁰ . First, we have 1

ηt

lnW_t+1⁰ Wt = 1

ηt

ln PN

i=1w_i,te^−η^t^`^i,t

Wt = 1

ηt

i=1

p_i,te^−η^t^`^i,t

≤ −

i=1

pi,t`i,t+ηt

8 ≤ −E£

`It,t¤ +ηt

where the second to last step follows from Hoeffding’s inequality (see, e.g., [25, Lemma A.1]) and the fact that`i,t∈[0, 1]. After rearranging, we get

E£

`It,t¤

≤ −1 ηt

lnW_t+1⁰ Wt +ηt

8. (6.1)

Rewriting the first term on the right hand side, we obtain lnW_t−lnW_t⁰₊₁

ηt = µlnWt

ηt −lnW_t+1 ηt+1

¶ +

µlnW_t+1

ηt+1 −lnW_t⁰₊₁ ηt

¶ . The first term can be telescoped as

t=1

µlnWt

ηt −lnWt+1

ηt+1

=lnW₁ η1

lnW₁−lnWT+1

ηT+1 ≤ −lnwi,T+1

ηT+1

= − 1 ηT+1

ln 1

Ne^−η^T⁺¹^L^i,T=Li,T+lnN ηT+1

, for any 1≤i≤N. Now to deal with the second term, observe that

Wt+1=

i=1

N e^−η^t+1^L^i,t=

i=1

1 N

¡e^−η^t^L^i,t¢^ηt+1_ηt

≤ Ã_N

i=1

1 Ne^−η^t^L^i,t

!^ηt+1_ηt

=¡

W_t⁰₊₁¢^ηt_ηt⁺¹ ,

where we have used thatηt+1≤ηt and thus we can apply Jensen’s inequality for the concave functionx^ηt+1^ηt ,x∈R. Thus we have

lnW_t+1

ηt+1 −lnW_t⁰₊₁ ηt ≤ηt+1

ηt

lnW_t⁰₊₁

ηt+1 −lnW_t⁰₊₁ ηt =0.

Substituting into (6.1) and summing for allt=1, 2, . . . ,T, we obtain

t=1

E£

`I_t,t¤

≤Li,T+1 8

t=1

ηt+ lnN ηT+1

Finally, since the losses`i,t,i ∈{1, 2, . . . ,N} and`It,t do not depend onηT+1fort ≤T, we can choose, without loss of generalityηT+1=ηT, and the statement of the lemma follows.

Remark6.1. Settingηt=p

lnN/T optimally (as a function of the time horizonT), the bound becomes

qTlnN

2 , while settingηt=2p

lnN/tindependent ofT, we haveE£

Lb_T^¤_≤^pTlnN. Now let us consider the total number of action switchesCT. The next lemma, which is a slightly improved and generalized version of Lemma 2 from [39] gives an upper bound on the expectation ofCT.

Lemma 6.3. LetCT denote the number of times themSD algorithm switches between different actions. Then

E[CT]≤ηTL^∗_T₋₁+lnN+

T−1

t=1

(ηt−ηT) where L^∗_T₋₁=mini∈{1,2,...,N}Li,T−1.

Proof. The probability of switching experts in stept≥2 is αtdef

= P[It−16=It]=

i=1

P[It−1=i] µ

1−ct

wi,t

wi,t−1

¡1−pi,t¢

≤

i=1

P[I_t−1=i] µ

1−c_t wi,t

w_i,t−1

=1−

i=1

wi,t−1

W_t−1 c_t wi,t

w_i,t−1=1−c_t W_t W_t−1 where in the last line we used Lemma 6.1. Reordering givesWt≤^1−α_c_t ^tWt−1and thus

WT≤W1 T

t=2

1−αt

c_t =

t=2

1−αt

c_t . On the other hand,

WT≥wT,j≥ 1

Ne^−η^T^L^j,T⁻¹.

for arbitraryj∈{1, 2, . . . ,N}. Taking logarithms of both inequalities and putting them together, we get

−lnN−ηTL^∗_T−1= −ηT min

j∈{1,2,...,N}Lj,T−1≤

t=2

ln(1−αt)−

t=2

lnct. Now using ln(1−x)≤ −xfor allx∈[0, 1), we obtain

E[CT]=

t=2

αt≤ηTL^∗_T₋₁+lnN−

t=2

lnct. Now the statement of the lemma follows since

−

t=2

lnct=

t=2

(ηt−1−ηt)(t−2)≤

T−1

t=1

(ηt−ηT).

In particular, forηt=p

lnN/T, we haveE[CT]≤p

TlnN+lnN, while settingηt=2p lnN/t, we obtainE[CT]≤4p

TlnN+lnN.

In document Online learning in non-stationary Markov decision processes (Pldal 106-114)