Online learning in non-stationary Markov decision processes

(1)

Online learning in non-stationary Markov decision processes

Gergely Neu

neu.gergely@gmail.com

Thesis submitted to the

Budapest University of Technology and Economics

in partial fulfilment for the award of the degree of

D

OCTOR OF

P

HILOSOPHY in Informatics

Supervised by Dr. László Györfi Dr. András György Dr. Csaba Szepesvári

Department of Computing Science and Information Technology Magyar tudósok körútja 2.

Budapest, HUNGARY March 2013

(2)

(3)

Abstract

This thesis studies the problem of online learning in non-stationary Markov decision processes where the reward function is allowed to change over time. In every time step of this sequential decision problem, a learner has to choose one of its available actions after observing some part of the current state of the environment. The chosen action influences the observable state of the environment in a stochastic fashion and earns some reward to the learner. However, the entire state (be it observed or not) also influences the reward. The goal of the learner is maximizing the total (non-discounted) reward that it receives. In this work, we assume that the unobserved part of the state evolves autonomously from the observed part of the state or the actions chosen by the learner, thus corresponding to a state sequence generated by an oblivious adversary such as nature. Otherwise, absolutely no statistical assumption is made about the mechanism generating the unobserved state variables. This setting fuses two important paradigms of learning theory: online learning and reinforcement learning. In this thesis, we propose and analyze a number of algorithms designed to work under various assumptions about the dynamics of the stochastic process characterizing the evolution of observable states.

For all of these algorithms, we provide bounds on the regret defined as the performance gap between the total reward gathered by the learner and the total reward of the best available fixed policy.

(4)

(5)

Acknowledgments

First and foremost, I would like to thank my supervisors Csaba Szepesvári and András György. I cannot be thankful enough to Csaba, who introduced me to the exciting field of reinforcement learning, and guided my first steps as a researcher. Later on, Andris initiated me into the subject of online prediction, which became my main interest through the years. Working with the two of them greatly inspired me both professionally and personally, and I’m deeply grateful for all their help. I would also like to thank László Györfi, who efficiently supported my work by encouraging me whenever I was uncertain about the next step.

I would also like to thank all my colleagues, ex-colleagues and related folks from the Depart- ment of Computer Science and Information Theory at the Budapest University of Technology and Economics (especially András Temesváry, Márk Horváth, Márta Pintér, László Ketskeméty and András Telcs for teaming up with me during the Great Frankfurt Snowstorm of 2010), the Reinforcement Learning and Artificial Intelligence Lab at the University of Alberta (especially Gábor Bartók along with his lovely family, Gábor Balázs, István Szita and Réka Mihalka, Dávid Pál, and Yasin Abbasi-Yadkori) and MTA SZTAKI (especially András Antos, the coauthor of the results presented in Chapter 4). Once again, I am very grateful to Csaba and his amazing family for welcoming me several times in Edmonton. I am also thankful to the unstoppable statisti- cian/cyclists Gábor Lugosi and Luc Devroye, also known as the coauthors of the results presented in Chapter 6.

Finally, I would like to express my gratitude to the people who obviously had the greatest impact on my entire life: my parents, brother and grandmother. Most of all, I am eternally grateful to Éva Kóczi for her endless love and support through the ups and downs of the past years.

(6)

(7)

In loving memory of József Lancz

(8)

Introduction

The work in this thesis generalizes two major fields of sequential machine learning theory: online learning [25, 20] and reinforcement learning [86, 17, 92, 85]. The characteristics of these learning models can be summarized as follows:

Online learning: In each time step t =1, 2, . . . ,T of the standard online learning (or online prediction) problem, the learner selects anactionat from a finite action spaceA^{, and} consequently earns somereward rt(at). The goal of the learner is to maximize its total expected reward. This problem can be easily treated by standard statistical machinery if the sequence of reward functions are generated in an i.i.d. fashion (that is, the rewards (rt(a))^T_t₌₁areindependent and identically distributed). However, this assumption does not account for dynamic data, let alone acting in a reactive environment. The power of the online learning framework lies in the fact that it does not require any statistical assumptions to be made about the data generation process: It is assumed that the sequence of reward functions (r_t)^T_t₌₁,r_t :A→[0, 1] is an arbitrary fixed sequence chosen by an external mechanism referred to as theenvironmentor theadversary. Of course, by dropping the strong statistical assumptions on the reward sequence, we can no longer hope to explicitly maximize the total cumulative rewards PT

t=1r_t(at) and thus have to settle for a less ambitious goal. This goal is to minimize the performance gap between our algorithm and the strategy that selects the best action fixed in hindsight. This performance gap is called theregretand is defined formally as

L_T₌max

a∈A T

X

t=1

rt(a)−

T

X

t=1

rt(at).

It is important to note that the best fixed action in the above expression can only be com- puted in full knowledge of the sequence of reward functions. While intuitively minimizing the regret seems to be very difficult, it is now a very well understood problem even in the significantly more challengingbanditsetting where the learner only observesrt(at) after making its decision. In recent years, numerous algorithms have been proposed for different versions of the online learning problem that consider different assumptions on the action spaceAand the amount of information revealed to the learner. The main

(11)

shortcoming of this problem formulation is that it does not adequately account for the influence of previous actions a1, . . . ,a_t−1 on the reward functionr_t, that is, it assumes that the decisions of the learner do not influence the mechanism generating the rewards.

The formalism presented in this thesis provides a way of modeling and coping with such effects.

Reinforcement learning: In every time stept=1, 2, . . . ,T of the standard reinforcement learning (RL) problem, the learner (oragent) observes thestatextof the environment, selects anactionatfrom a finite action spaceA, and consequently earns somereward r(xt,at).

Finally, the next statext+1of the environment is drawn from the distributionP(·|xt,at).

It is assumed that the state space of the environment Xis finite, the reward function r :X×A→[0, 1] and thetransition function P:X×X×A→[0, 1] are fixed but unknown functions. The goal of the learner is to maximize its total reward in theMarkov decision process(MDP) described above by the tuple (X,A,P,r). It is commonplace to consider learning algorithms that map the states of the environment to actions with astationary state-feedback policy(or in short, a policy)π:X→A. A policy can be evaluated in terms of its total expected reward afterTtime steps in the MDP:

R^π_T=E

" _T X

t=1

r(x⁰_t,π(x⁰_t))

# ,

wherex⁰_t₊₁∼P(·|x⁰_t,π(x⁰_t)). Writing the total expected reward of the learner as

RbT=E

" _T X

t=1

r(xt,at)

# ,

we can define a good performance measure of a reinforcement learning algorithm then using the following notion of regret:

Lb_T₌max

π R^π_T−RbT.

The regret measures the amount of “lost reward” during the learning process, that is, the price of not knowingrandPbefore the learning process starts.

The main limitation of the MDP formalism is that it does not account for non-random state dynamics that cannot be captured by the Markovian modelP. In this thesis, we present a formalism that goes beyond the standard stochastic assumptions made in the reinforcement learning literature and provide algorithms with good theoretical performance guarantees.

In this thesis we study complex reinforcement learning problems where the performance of learning algorithms is measured by the total reward they collect during learning, andthe assumption that the state is completely Markovian is relaxed. Our formalism is based on prin- ciples of reinforcement learning and online learning: we regard these complex problems as Markov decision problems where the reward functions are allowed to change over time, and

(12)

propose algorithms with theoretically guaranteed bounds on their performance. The performance criterion that we address is the worst-case regret, which is typically considered in the online learning literature. Learning in this model is called the online MDP or O-MDP problem.

The main idea of our approach relies on the observation that in a number of practical problems,the hard-to-model, complex part of the environment influences only the rewards that the learner receives. In the rest of this chapter, we describe our general model and show how it pre- cisely relates to the two learning paradigms described above. In Section 1.2, we summarize the contributions of the present thesis. The most important other results from the literature are discussed in Section 1.3. We conclude this chapter with Section 1.4, where we briefly describe some practical problems where our formalism can be applied.

The rest of the thesis is organized as follows: In Chapter 2, we review some relevant concepts from online learning and reinforcement learning. The purpose of the chapter is to set up the formal definitions later to be used in the thesis. In Chapters 3–5, we discuss different versions of the online MDP problem and propose algorithms for the specified learning problems.

We provide rigorous theoretical analysis to each of the proposed algorithms. In Chapter 6, we describe a special case of our setting called online learning with switching costs. For this problem, we provide two algorithms with optimal performance guarantees. In Chapter 7, we apply the methods described in Chapter 6 to construct adaptive coding schemes for the problem of online lossy source coding. We show that the proposed algorithm enjoys near-optimal performance guarantees. Since it is difficult to evaluate the contributions of Chapters 3–5 separately, we present conclusions for all chapters in Chapter 8.

1.1 The learning model

The interaction between the learner and the environment is shown in Figure 1.1. The environment is split into two parts: One part that has Markovian dynamics and another one with an unrestricted, autonomous dynamics. In each discrete time stept, the agent receives the statextof the Markovian environment and possibly the previous statey_t−1of the autonomous dynamics. The learner then makes a decision about the next actionat, which is sent to the environment. The environment then makes a transition: the next state of the Markovian environment depends stochastically on the current state and the chosen action asx_t+1∼P(·|xt,at), and y_t+1is generated by an autonomous dynamic which is not influenced by the learner’s actions or the state of the Markovian environment. After this transition, the agent receives a reward depending on thecompletestate of the environment and the chosen action, and then the process continues. The goal of the learner is to collect as much reward as possible. The modeling philosophy is that whatever information about the environment’s dynamics can be modeled should be modeled in the Markovian part and the remaining “unmodeled dynamics” is what constitutes the autonomous part of the environment.

A large number of practical operations research and control problems have the above outlined structure. These problems include production and resource allocation problems, where the major source of difficulty is to model prices, various problems in computer science, such as

(13)

!"#$%&'()*"+,-.' /01*2'

314"#5'67*-8%*'

9*-%*2#%::15' ()*"+,-.'

a

_t

x

_t

y

_t

r

_t

;^<='

Figure 1.1: The interaction between the learner and the environment. At timet, the agent’s action isat, the state of the Markovian dynamics isxt, the state of the uncontrolled dynamics isyt;q⁻¹is a one-step delay operator.

thek-server problem, paging problems, or web-optimization problems, such as ad-allocation problems with delayed information [see, e.g., 33, 105].

In the rest of the thesis, for simplicity, by slightly abusing terminology, we call the statext

of the Markovian part “the state” and regard dependency onyt as dependency ontby letting rt(·,·)=r(·,·,yt). The goal of the learner is to maximize its total expected reward

RbT=E

" _T X

t=1

rt(xt,at)

¯

¯ P

# ,

where the notationE[·|P] is used to emphasize that the state sequence (xt)^T_t₌₁is generated by the transition functionP. Controllers of the formπ:X→A^{are called}stationary deterministic policies, whereXis the state-space of the Markovian part of the environment andA^{is the set} of actions. The performance of policyπis measured by the total expected reward

R_T^π=E

" _T X

t=1

r_t(x⁰_t,π(x⁰t))

¯

¯ P,π

# ,

where (x⁰_t)^T_t₌₁is the state sequence obtained by following policyπin the MDP described byP. The learner’s goal is to perform nearly as well as the best fixed stationary policy in hindsight in terms of the total reward collected, that is, to minimize the following quantity:

Lb_T₌max

π R_T^π−RbT. (1.1)

In other words, we are interested in constructing algorithms that minimize thetotal expected

(14)

regretdefined as the gap between the total accumulated reward of the learner and the best fixed controller.

Naturally, no assumptions can be made about the autonomous part of the environment as it is assumed that modeling this part of the environment lies outside of the capabilities of the learner. Guaranteeing a low regret is equivalent to arobust control guarantee: The guarantee on the performance must hold no matter how the autonomous state sequence (yt), or equivalently, the reward sequence (rt) is chosen. The potential benefit is that the results will be more generally applicable and the algorithms will enjoy added robustness, while, generalizing from results available for supervised learning [23, 62, 87], the algorithms can also avoid being too pessimistic despite the strong worst-case guarantees.¹

1.2 Contributions of the thesis

We have studied the above problem under various assumptions on the structure of the under- lying MDP and the feedback provided to the learner. To be able to present our contributions, we present our assumptions informally in the following list. The precise assumptions are presented in Chapter 2.

1. Loop-free episodic environmentsare episodic MDPs where transitions are only possible in a “forward” manner. Episodic MDPs capture learning problems where the learner has to repeatedly perform similar tasks consisting of multiple state transitions. At the begin- ning of each episode, the learner starts from a fixed statex0and the episode ends when the goal statex_Lis reached. We assume that all other states in the state spaceX^{can only} be visited once, thus, the transition structure does not allow loops. The reward function rt:X×A→[0, 1] remains fixed during each episodet =1, 2, . . . ,T, but can change arbitrarily between consecutive episodes. In each time stepl=0, 1, . . . ,L−1 of episodet, the learner observes its statex^(t)_l and has to decide about its actiona^(t)_l . The total expected reward of the learner is defined as

RbT=E

" _T X

t=1 L−1

X

l=0

rt(x^(t)_l ,a^(t)_l )

¯

¯ P

# , and the total expected reward of policyπis defined as

R^π_T=E

" _T X

t=1 L−1X

l=0

rt(x⁰_l,a⁰_l)

¯

¯ P,π

# ,

where we used the notationE[·|P,π] to emphasize that the trajectory (x⁰_l,a⁰_l)^L_l₌₀⁻¹is generated by transition modelP and policyπ. The minimal visitation probabilityαis defined

1Sometimes, robustness is associated to conservative choices and thus poor “average” performance. Although we do not study this question here, we note in passing that the algorithms we build upon have “adaptive variants”

that are known to adapt to the environment in the sense that their performance improves when the environment is

“less adversarial”.

(15)

as

α=min

x∈X min

π∈A^XP£

∃l:x⁰_l=x¯

¯P,π¤ .

This problem will be often referred to as theonline stochastic shortest path(O-SSP) problem.

(a) Full feedback with known transitions:We assume that the transition functionPis fully known before the first episode and the reward functionr_t is entirely revealed after episodet.

(b) Bandit feedback with known transitions: We assume that the transition function P is fully known before the first episode, but the reward functionrtis only revealed along the trajectory traversed by the learner in episodet. In other words, the feedback provided to the learner after episodetis

³x^(t)_l ,a^(t)_l ,rt(x^(t_l ⁾,a^(t)_l )´L−1 l=0.

(c) Full feedback with unknown transitions: We assume that P is unknown to the learner, but the reward function r_t is entirely revealed after episodet. The layer structure of the state space and the action space is assumed to be known and the traversed trajectory is also revealed to the learner. In other words, the feedback provided to the learner after episodetis

µ³

x^(t)_l ,a^(t)_l ´L−1 l=0,r_t

¶ .

2. Unichain environmentsare continuing MDPs where no episodes are specified. In each time stept=1, 2, . . . ,T, the learner observes the statextand has to decide about its action at, while the reward functionr_t :X×A→[0, 1] is also allowed to change after each time step. For any stationary policyπ:X→A, we define the elements of thetransition kernel P^πas

P^π(x|y)=P(x|y,π(y))

for allx,y∈X. We assume that for each policyπ, there exists a unique probability distri- butionµ^πover the state space that satisfies

µ^π(x)= X

y∈Xµ^π(y)P^π(x|y)

for all x∈X. The distributionµ^π is called thestationary distributioncorresponding to policyπ. The minimal stationary visitation probabilityα⁰is defined as

α⁰=min

x∈X min

π∈A^Xµ^π(x).

We assume that every policyπhas a finitemixing timeτ^π>0 that specifies the speed of convergence to the stationary distributionµ^π. The total expected reward of the learner is

(16)

defined as

RbT=E

" _T X

t=1

rt(xt,at)

¯

¯ P

# , and the total expected reward of any policy is defined as

RbT=E

" _T X

t=1

rt(x⁰_t,a⁰_t)

¯

¯ P,π

# ,

where the trajectory (x⁰_t,a⁰_t) is generated by following policyπin the MDP specified byP. (a) Full feedback with known transitions:We assume that the transition functionPis fully known before the first time step and the reward functionrtis entirely revealed after time stept.

(b) Bandit feedback with known transitions:We assume that the transition functionP is fully known before the first time step, but the reward functionrt is only revealed in the state-action pair visited by the learner in time step t. In other words, the feedback provided to the learner after time steptis

(rt(xt,at),x_t+1) .

3. Online learning with switching costsis a special version of the online prediction problem where switching between actions is subject to some costK >0. Alternatively, this problem can be regarded as a special case of our general setting, as we can construct a simple online MDP that can be used to model all online learning problems where switching between experts is expensive. The online MDP (X^,A^,^P^{, (r}t)^T_t₌₁) in question is specified as follows: The statex_t+1of the environment is identical to the previously selected actionat. In other words,X=Aand the transition functionP is such that for allx,y,z∈A^,^P(y|x,y)=1, otherwiseP(z|x,y)=0 ifz6=y. The reward function in this online MDP is defined using the original reward functiongt:A→[0, 1] of the prediction problem and the switching costKas

rt(x,a)=gt(a)−KI{a6=x}

for all (x,a)∈A². Note thatKis allowed to be much larger than the maximal reward of 1.

We consider two subclasses of online learning problems with switching costs.

(a) Online prediction with expert advice:We assume that the action setAis relatively small and the environment is free to choose the rewards for each different action.

Actions in this setting are often referred to as “experts”.

(b) Online combinatorial optimization: We assume that each action can be represented by ad-dimensional binary vector and the environment can only choose the rewards given for selecting each of thedcomponents. Formally, the learner has access to the action spaceA⊆{0, 1}^d and in each roundt, the environment specifies

(17)

a vector of rewardsgt∈R^d. The reward given for selecting actiona∈Ais the inner productg_t(a)=g_t^>a.

4. The online lossy source coding problem is a special case of Setting 3 where a learner has to encode a sequence ofsource symbols z1,z2, . . . ,zT on a noiseless channel and produce a sequence of reproduction symbolszˆ1, ˆz2, . . . , ˆzT. A coding scheme consists of an encoder f mapping source symbols (z)^T_t₌₁ to channel symbols (y)^T_t₌₁ and a decoder g mapping channel symbols (y)^T_t₌₁to reconstruction symbols ( ˆz)^T_t₌₁. We assume that the learner has access to a fixed pool of coding schemesF. The goal of the learner is to select coding schemes (ft,gt)∈Fthat minimize the cumulative distortion between the source sequence and the reproduction sequence, defined as

DbT=

T

X

t=1

dt(zt, ˆzt),

where the sequence (ˆz)^T_t₌₁is produced by the sequence of applied coding schemes andd is a given distortion measure. We make no statistical assumptions about the sequence of source symbols. Denoting the cumulative distortion of a fixed coding scheme (f,g)∈F asDT(f,g), the goal of the learner can be formulated as minimizing the expectednor- malized distortion redundancy

Rb_T₌ ¹ T µ

E£ Db¤

T− min

(f,g)∈FDT(f,g)

¶ ,

which is in turn equivalent to regret minimization in the online learning problem where rewards correspond to negative distortions. Additionally, in each time stept, the learning algorithm has to ensure that the receiving entity is informed of the identity of the decoder gt to be used for decoding the t-th channel symbol. We assume that transmitting the decodergt is only possible on the same channel that is used for transmitting the source sequence. This gives rise to a cost for switching between coding schemes, making this problem an instance of problems described in Setting 3a.

The contributions of this thesis for each setting are listed below on Figure 1.2. All performance guarantees proved in Settings 1a through 2b concern learning schemes that have not been covered by previous works, with the exception of 2a, where our contribution is improving the regret bounds given by Even-Dar et al. [33]. Our main results concerning online MDPs are also presented in Table 1.1, along with the most important other results in the literature. For Setting 3a, we propose a new prediction algorithm with optimal performance guarantees. The same approach can be used for learning in Setting 3b, a problem that has not yet been addressed in the literature. The results for Setting 4 significantly improve a number of previous performance guarantees known for the problem: our performance guarantees on the expected regret match the best known lower bound for the problem up to a logarithmic factor.

(18)

Setting 1a

• Guarantee on the expected regret against the pool of all stationary policies (Proposition 3.1).

Setting 1b

• Guarantees on the expected regret against the pool of all stationary policies as- sumingα>0 (Theorems 3.1, 3.2, 3.4).

• Guarantees on the expected regret against the pool of all non-stationary policies assumingα>0 (Theorems 3.3, 3.5).

• Guarantee on the expected regret against the pool of all stationary policies al- lowingα=0 (Theorem 3.6).

• High-confidence guarantee on the regret against the pool of all stationary policies assumingα>0 (Theorem 3.7).

Setting 1c

• Guarantee on the expected regret against the pool of all stationary policies (Theorem 5.1).

Setting 2a

• Improved guarantee on the expected regret against the pool of all stationary policies (Theorem 4.1).

Setting 2b

• Guarantee on the expected regret against the pool of all stationary policies as- sumingα⁰>0 (Theorem 4.2).

Setting 3a

• Optimal guarantees on both the expected regret against the pool of the fixed actions inAand the number of action switches (Theorem 6.1).

Setting 3b

• Near-optimal guarantees on both the expected regret against the pool of the fixed actions inA⊆{0, 1}^dand the number of action switches (Theorem 6.2).

Setting 4

• Optimal guarantee on the expected regret against any finite pool of reference classesF(Theorem 7.1).

• Optimal guarantee on the expected regret against the pool of all quantizers with quadratic distortion measure (Theorem 7.3).

Figure 1.2: The contributions of the thesis.

(19)

rtobserved rt(xt,at) observed

Pknown • Even-Dar et al. [33]

– unichain environment – LbT =O⁽τ²p

Tlog|A|)

• Neu et al. [78]

– SSP environment – LbT===O(L²p

T|A|/α)

• Neu et al. [81]

– unichain environment – LbT===O⁽τ^3/2p

T|A|/α⁰)

Punknown

• Jaksch et al. [59]

– stochastic rewards – connected environment – Lb_T ₌O^˜^(D|X|p

T|A|)

• Neu et al. [79]

– SSP environment – Lb_T ₌₌₌O(L|||X||||||A|||p

T)

Future work

Table 1.1: Upper bounds on the regret for different feedback assumptions. Our results are type- set in boldface. Rewards are assumed to be adversarial unless otherwise stated explicitly.

1.3 Related work

As noted earlier, our work is closely related to the field of reinforcement learning. Most works in the field consider the case when the learner controls a finite Markovian Decision Process (MDP, see [19, 60, 15, 59], or Section 4.2.4 of [93] for a summary of available results and the references therein). While there exists a few works that extend the theoretical analysis beyond finite MDPs, these come at strong assumptions on the MDP (e.g., [61, 91, 1]).

The first work to address the theoretical aspects of online learning in non-stationary MDPs is due to Even-Dar et al. [32, 33], who consider the case when the reward function is fully observable. They propose an algorithm, MDP-E, which uses some (optimized) experts al- gorithm in every state fed with the action-values of the policy used in the last round. As- suming that the MDP is unichain and that the worst mixing timeτ over all policies is uni- formly small, the regret of their algorithm is shown to be ˜O⁽τ²p

T). Part of our work recycles the core idea underlyingMDP-E: it uses black-box bandit algorithms at every state. An alter- native Follow-the-Perturbed-Leader-type algorithm was introduced by Yu et al. [105] for the same full-information unichain MDP problem. The algorithm comes with improved compu- tational complexity but an increased ˜O(T^3/4+²) regret bound. Concerning bandit information and unichain MDPs, Yu et al. [105] introduced an algorithm with vanishing regret (i.e., the algorithm is Hannan consistent). Yu and Mannor [103, 104] considered the problem of on-line learning in MDPs where the transition probabilities may also change arbitrarily after each transition. This problem is significantly more difficult than the case where only the reward function is changed arbitrarily. Accordingly, the algorithms proposed in these papers fail to achieve

(20)

sublinear regret. Yu and Mannor [104] also considered the case when rewards are only observed along the trajectory traversed by the agent. However, this paper seems to have gaps: If the state space consists of a single state, the problem becomes identical to the non-stochastic multi-armed bandit problem. Yet, from Theorem IV.1 of Yu and Mannor [104] it follows that the expected regret of their algorithm isO(p

log|A|T), which contradicts the knownΩ(p

|A|T) lower bound on the regret (see Auer et al. [10]).²

Some parts of our work can be viewed as a stochastic extensions of works that considered online shortest path problems in deterministic settings. Here, the closest to our ideas and algorithm is the paper by György et al. [47]. They implement a modified version of the Exp3algorithm of Auer et al. [9] over all paths using dynamic programming and estimating the reward-to-go via estimates for the immediate rewards. The resulting algorithm is shown to achieveO⁽^p^T) regret that scales polynomially in the size of the problem, and can be implemented with linear complexity. A conceptually harder version of the shortest path problem is when only the rewards of the whole paths are received, and the rewards corresponding to the individual edges are not revealed. Dani et al. [28] showed that this problem is actually not harder, by proposing a generalization ofExp3to linear bandit problems, which can be applied to this setting and which gives an expected regret ofO⁽^p^T) (again, scaling polynomially in the size of the problem), improving earlier results of Awerbuch and Kleinberg [12], McMahan and Blum [76], György et al. [47]. Bartlett et al. [14] showed that the algorithm can be extended so that the bound holds with high probability, while Cesa-Bianchi and Lugosi [26] improved the method of [28] for bandit problems with some special “combinatorial” structure. While the above very similar approaches (an efficient implementation of a centralizedExp3variant) are also appealing for our problem, they cannot be applied directly: The random transitions in the MDP structure disables the dynamic programming-based approach of György et al. [47]. Fur- thermore, although Dani et al. [28] suggest that their algorithm can be implemented efficiently for the MDP setting, this does not seem to be straightforward at all. This is because the applica- tion of their approach requires representing policies via the distributions (or occupancy measures) they induce over the state space, but the non-linearity of the latter dependence makes dynamic programming highly-nontrivial and causes difficulties in describing linear combina- tion of policies.

The contextual bandit setting considered by Lazaric and Munos [70] can also be regarded as a simplified version of our model, with the restriction that the states are generated in an i.i.d. fashion. More recently, Arora et al. [3] gave an algorithm for MDPs withdeterministictran- sitions, arbitrary reward sequences and bandit information. Following the work of Ortner [83], who studied the same problem with i.i.d. rewards, they note that following any policy in a deterministic MDP leads to periodic behavior, and thus finding an optimal policy is equivalent to finding an “optimal cycle” in the transition graph. While this optimal cycle is well defined for stationary rewards, Arora et al. observe that it can be ill-defined for non-stationary reward

2To show this contradiction note that the conditionT>Nin the bound of Theorem IV.1 of Yu and Mannor [104]

can be traded for an extraO(1/T) term in the regret bound. Then the said contradiction can be arrived at by letting

²,δconverge to zero such that²/δ³→0.

(21)

sequences: in particular, it is easy to construct an example where the same policy can incur an average reward of either 0 or 1, depending on the state where we start to run the policy. A meaningful goal in this setting is to compete with the best meta-policy that can run any stationary policystarting from any state as its initial state. Arora et al. give an algorithm that enjoys a regret bound ofO^(T^3/4) against the pool of such meta-policies.

1.4 Applications

In this section, we outline how some real-world problems fit into our framework. The common feature of the examples to be presented is that the state space of the environment is a product of two parts: acontrolledpartXand an uncontrolled partY. In all of these problems, the evolution of the controlled part of the state can be modeled as a Markov decision process described by

©X^,A^,^P^ª, while no statistical assumptions can be made about the sequence of uncontrolled state variables (yt)^T_t=1. We assume that interaction between the controlled and uncontrolled parts of the environment is impossible, that is, the stochastic transitions of (xt)^T_t₌₁cannot be influenced by the irregular transitions of (yt)^T_t=1, and vice versa.

Inventory management

Consider the problem of controlling an inventory so as to maximize the revenue. This is an optimal control problem, where the state of the controlled system is the stockxt, the actionat

is the amount of stock ordered. The evolution of the stock is also influenced by the demand, which is assumed to be stochastic. Further, the revenue depends on the prices at which prod- ucts are bought and sold. By assumption, the prices are not available at the time when the decisions are made. Since the prices can depend on many external, often unobserved factors y_t, their evolution is often hard to model. We assume that the influence of our purchases on the prices is negligible. Sinceytis unobserved, this problem is covered by our Settings 1b and 2b.

Controlling the engine of a hybrid car

Consider the problem of switching between the electric motor and the internal combustion engine of a hybrid car so as to optimize fuel consumption (see, e.g., [13]). In this setting the favorable engine depends on the road conditions and the intentions of the driver: for example, driving downhill can be used to recharge the batteries of the electric motor, while picking up higher speeds under normal road conditions can be more efficient with the internal combustion engine. We assume that we can observe the partial statext of the engines and can decide to initiate a switching procedure from one engine to the other. The execution of the switching procedure depends on the statext of the engines in a well-understood stochastic fashion.

Other parts of the state of engineyt may or may not be observed. External conditions only influence this state variable and do not interfere withxt. That is, the statey_tcan be seen as the uncontrolled state variable influencing the rewards given for high fuel efficiency. Sinceytis not entirely observed, this problem is covered by our Settings 1b and 2b.

(22)

Storage control of wind plants

Hungarian regulations require wind plants to produce schedules on their actual production on a 15-minute basis, one month prior. If production exceeds a certain range of the schedule, the producer should pay a penalty-tariff depending on the deviation. As discussed by Hartmann and Dán [52], a possible way of meeting the schedule under adverse wind conditions is using energy storage units: the excess energy accumulated when production would exceed the schedule can be fed into the power system when wind energy stays below the desired level. In this setting, we can assume that the statextof the energy storage unit can be captured by a possibly unknown Markovian model, while the evolution of wind speed over time (yt)^T_t=1is clearly uncontrolled. Since the schedules are fixed well in advance, its influence can be incorporated into the Markovian model as well. This way, the uncontrolled state variable only influences the rewards (or negative penalties), and thus this problem also fits into our framework. Sinceytis observed, this problem is covered by our Settings 1a and 2a. The case when the exact dynamics of the energy storage unit is unknown is covered by Setting 1c.

Adaptive routing in computer networks

Consider the problem of routing in a computer network where the goal is transferring packets from a source nodex0to a designated drain nodexL with minimal delay (see, e.g., [50]). The delays can be influenced by external events such as malfunction of some of the internal nodes.

These external events are captured by the uncontrolled statey_t. Assume that in each nodex^(t)_l , we can choose the next node using some interfacea^(t)_l ∈Aof the network layer. Assuming that this interface decides about the actual next statex^(t)_l₊₁using a simple randomized algorithm, we can cast our problem as an online learning problem in SSPs, covered by our Setting 1. If the delays are only observed on the actual path traversed by the packet and the algorithm implemented by the interfacesa∈Ais known, the problem is covered by Setting 1b. Assuming that y_t is revealed by some oracle after sending each packet, our algorithms for Setting 1c can be used even if the randomized algorithms used in the network layer are unknown.

Growth optimal portfolio selection with transaction cost

Consider the problem of constructing sequential investment strategies for financial markets where at each time step the investor distributes its capital amongd assets (see, e.g., Györfi and Walk [42]). Formally, the investor’s decision at timet is to select a portfolio vectorat ∈ [0, 1]^dsuch thatPd

i=1(at)i=1. Thei-th component ofat gives the proportion of the investor’s capitalNt invested in asseti at timet. The evolution of the market in time is represented by the sequence of market vectorss1,s2, . . . ,sT ∈[0,∞)^d, where thei-th component ofstgives the price of thei-th asset at timet. It is practical to define the return vector y_t ∈[0,∞)^d at time t with components (yt)i= _(s^(s_t−1^t⁾ⁱ₎_i. Furthermore, we assume that switching between portfolios is subject to some additional cost proportional to the price of the assets being bought or sold.

The goal of the investor is to maximize its capitalNT, or, equivalently, to maximize its average growth rate_T¹logNT. The problem of maximizing the growth rate under transition costs can be

(23)

formalized as an online MDP where the state at timetis given by the previous portfolio vector a_t−1and the reward given for choosing actionatat statext=a_t−1is

r(xt,yt,at)=log

µNt−ct

Nt

¶ +log¡

a^>_t yt¢ ,

wherectis the transaction cost arising at timet. Since the relation betweenctand the statext, the actionat and the capitalNt is well-defined and the rewards are influenced by the uncontrolled sequence (yt)^T_t=1in a transparent way, we can assume full feedback. After discretizing the space of portfolios and prices, we can directly apply learning algorithms devised for Set- ting 2a to construct sequential investment strategies. The problem can also be approximately modeled by Setting 3a when assuming thatct ≤αNt holds for some constantα∈(0, 1): Up- per bounding the regret on the online learning problem with rewardsgt(a)=log¡

a^>yt¢ and switching costK = −log(1−α), we obtain a crude upper bound on the regret of the resulting sequential investment strategy.

(24)

Chapter 2

Background

In this chapter, we precisely describe the learning models outlined in Chapter 1. In the first half of the chapter, we review some concepts of online learning that are relevant for our work. The rest of the chapter discusses important tools for Markov decision processes that will be useful in the later chapters.

Throughout the thesis, will use boldface letters (such asx,a,u, . . .) to denote random variables. We usekvkpto denote theLp-norm of a function or a vector. In particular, forp= ∞the maximum norm of a functionv:S→Ris defined askvk∞=sup_s_∈_S|v(s)|, and for 0≤p< ∞and for any vectorv=(v1, . . . ,v_d)∈R^d,kvkp=¡P_d

i=1|vi|^p¢1/p

. We will use ln to denote the natural logarithm function. For a logical expressionA, the indicator functionI{A}is defined as

I{A}=







0, ifAis false 1, ifAis true.

2.1 Online prediction of arbitrary sequences

The general protocol of online prediction (also called online learning) is shown on Figure 2.1.

Learning algorithms for the online prediction problems are often referred to asexperts algo- rithms. Formally, an experts algorithmE with action setAE and an adversary interact in the following way: At each roundt, algorithmE chooses a distributionpt over the actionsAE

Parameters: finite set of actionsA, upper boundHon rewards, feedback alpha- betΣ, feedback functionft: [0,H]^A→Σ.

For allt=1, 2, . . . ,T, repeat

1. The environment choosesrt:A→[0,H] for alla∈A. 2. The learner chooses actionat.

3. The environment gives feedbackft(rt,at)∈Σto the learner.

4. The learner earns rewardrt(at).

Figure 2.1: The online prediction protocol.

(25)

and picks an actionat according to this distribution. The adversary selects a reward function r_t:AE→[0,H], whereH>0 is known to the learner. Then, the adversary receives the action at and gives a reward ofrt(at) toE. In the most general model,E only gets to observe some feedback ft(rt,at), where ft is a fixed and known mapping from reward functions and actions to some alphabetΣ. Underfull-information feedback, the algorithmE receivesf_t(r_t,at)=r_t, that is, it gets to observe the entire reward function. When onlyft(rt,at)=rt(at) is sent to the algorithm, we say that the algorithm works underbandit feedback. The goal of the learner in both settings is to minimize thetotal expected regretdefined as

Lb_T₌max

a∈AE

E

"_T X

t=1

rt(a)

#

−E

"_T X

t=1

rt(at)

# ,

where the expectation is taken over the internal randomization of the learner.

It will be customary to consider two types of adversaries: Anoblivious adversarychooses a fixed sequence of reward functionsr1, . . . ,rT, while a non-oblivious, oradaptiveadversary is defined withT, possibly random functions such that thet-th function acts on past actions a1, . . . ,at−1and the history of the adversary (including the previous reward functions selected, as well as any variable describing the earlier evolution of the inner state of the adversary) and returns therandomreward functionrt.

So far we have considered experts algorithms for a given horizon and action set. However, experts algorithms in fact come as meta-algorithms that can be specialized to a given horizon and action set (i.e., the meta-algorithms take as inputT and the action setAE). In the future, at the price of abusing terminology, we will call such meta-experts algorithms experts algorithms, too.

Definition 2.1. LetCbe a class of adversaries and consider an experts algorithmE. The func- tionB_E: {1, 2, . . .}×{1, 2, . . .}→[0,∞) is said to be a regret bound forEagainstCif the following hold:(i)B_E is non-decreasing in its second argument; and(ii)for any adversary fromC^{, time} horizonT and action setAE, the inequality

Lb_T₌_max

a∈AE

E

"_T X

t=1

r_t(a)−

T

X

t=1

r_t(at)

#

≤B_E_(T,|AE|) (2.1)

holds true, where (a1,r₁), . . . , (aT,r_T) is the sequence of actions and reward functions that arise from the interaction ofEand the adversary whenEis initialized withT andAE.

Note that both terms of the regret involve the same sequence of rewards. Although this may look unnatural when the opponent is not oblivious, the definition is standard and will be useful in this form for our purposes.¹ In what follows, to simplify the notation, we will also useB_E(T,AE) to meanB_E(T,|AE|). As usual, for the regret bound to make sense it has to be a sublinear function ofT. Some algorithms do not needT as their input (all algorithms need to

1In Chapters 3 and 4, we decompose the online MDP problem into smaller online prediction problems where the reward sequences are generated by a non-oblivious adversary. The global decision problem then reduces to the problem of controlling the above notion of regret against this non-oblivious adversary.

(26)

know the action set, of course). Such algorithms are calleduniversal, whereas in the opposite case the algorithm is called non-universal. Oftentimes, a non-universal method can be turned into a universal one, at the price of deteriorating the boundB_Eby at most a constant factor, by either adaptively changing the algorithm’s internal parameters or by resorting to the so-called

“doubling trick” (cf. Exercises 2.8 and 2.9 in [25]).

Typically, algorithms are developed for the case when the adversaries generate rewards be- longing to the [0, 1] interval. Given such an algorithmE, if the adversaries generate rewards in [0,H] instead of [0, 1] for someH>0, one can still useE such that in each round, the reward that is fed toEis first divided byH. We shall refer to such a “composite” algorithm asE tuned to the maximum reward H. Clearly, ifEenjoyed a regret boundB_E against the class of all oblivious (respectively, adaptive) adversaries with rewards in [0, 1],Etuned to the maximum reward H will enjoy the regret bound HB_E against the class of all oblivious (respectively, adaptive) adversaries that generate rewards in [0,H]. Obviously, this tuning requires prior knowledge of H.

Note that typical full-information experts algorithms enjoy the same regret bounds for oblivious and non-oblivious adversaries. The reason for this is that if the distributionptselected by the experts algorithmE is fully determined by the previous reward functionsr1, . . . ,rt−1, then any regret bound that holds against an oblivious adversary also holds against a non-oblivious one by Lemma 4.1 of Cesa-Bianchi and Lugosi [25]. To our knowledge, all best experts algorithms that enjoy a regret bound of optimal order against an adaptive adversary satisfy this condition onpt.

Anoptimized best experts algorithmin the full-information case is an algorithm whose expected regret can be bounded byB_OE(T,A⁾=O⁽^p^T^ln|A|), and similarly, anoptimized|A|- armed bandit algorithmis one with a boundB_OB(T,A)=O(p

T|A|) on its expected regret (in the case of bandit algorithms, actions are also called arms). Optimized best experts algorithms include theexponentially weighted average forecaster(EWA) (a variant of weighted majority algorithm of Littlestone and Warmuth [72], and aggregating strategies of Vovk [96], also known as Hedge by Freund and Schapire [36]) and theFollow-the-Perturbed-Leader(FPL) algorithm by Kalai and Vempala [63] (see also Hannan [51]). There exist a number of algorithms for the bandit case that attain regrets ofO(p

T|A|ln|A|), such asExp3by Auer et al. [10] andGreen by Allenberg et al. [2], while the algorithm presented by Audibert and Bubeck [4] achieves the optimal rateO⁽^p^T|A|). Although these papers prove the regret bound (2.1) only in the case of oblivious adversaries, that is, when the actions taken by the algorithm have no effect on the next rewards chosen by the adversary, it is not hard to see that the bounds proved in these papers continue to hold in the non-oblivious case, too. This is because the only non-algebraic step in the regret bound proofs uses that the reward estimates constructed by the respective algorithms are unbiased estimates of the actual rewards and this property continues to hold true even in the non-oblivious setting. In addition to the above algorithms, Poland [84] describes Follow-the-Perturbed Leader-type algorithms that also satisfy these requirements.

All the algorithms discussed above developed to deal with rewards in the [0, 1] interval. For rewards in [0,H], using the tuning method described above would yield a regret bound that

Online learning in non-stationary Markov decision processes