Markov decision processes with total effective payoff

Endre Boros¹, Khaled Elbassioni², Vladimir Gurvich¹, and Kazuhisa Makino³

1 Rutgers University, NJ, USA

2 Masdar Institute, Abu Dhabi, UAE

3 RIMS, Kyoto, Japan eboros@business.rutgers.edu

Abstract. We consider finite state Markov decision processes with undis-counted total effective payoff. We show that there exist uniformly opti-mal pure and stationary strategies that can be computed by solving a polynomial number of linear programs. This implies that in a two-player zero-sum stochastic game with perfect information and with total ef-fective payoff there exists a stationary best response to any stationary strategy of the opponent. From this, we derive the existence of a uni-formly optimal pure and stationary saddle point. Finally we show that the traditional mean payoff can be viewed as a special case of total payoff.

Keywords: Markov processes·Stationary strategies.

We consider finite state, finite action Markov decision processes with undis-counted total effective payoff. We denote byV the set of states, and byvt∈V the state at which the system is at timet. The controller (Max) chooses an action, that results in a transition to statevt+1. Note that this transition is stochastic, and thusvt,t= 0,1, ...are random variables. Every transitionvt→vt+1results in a local reward r(vt, vt+1) that is known in advance and depends only on the pair of states.

A policy (strategy) of Max is a mapping that to any time moment t and statevtassigns a choice of actions. Such a policy maybe stochastic, may depend on the history of previous choices, etc. We say that a policy ispositional if this choice depends only on the current state. We say that a policy isdeterministicif actions are chosen with 0/1 probabilities. Finally we say that a policy isuniformly optimal, if it is an optimal policy for all possible initial states.

Once an initial statev0∈V is fixed, andMaxchooses a strategys∈S, the above process produces a series of statesvt(s)∈V,t= 0,1, . . ., which generally are random variables. We associate to such a process the sequence of expected local rewards and consider two payoff functions that measure the quality of such an infinite process:

φ (v ) = lim inf 1 X^T

E [r(v, v )], (1)

ψs(v0) = lim inf

T→∞

1 T+ 1

t=0 t

j=0

Es[r(vj, vj+1)]. (2)

The first one, called mean payoff, is classic [4], [1]. The second one, called total payoffortotal reward, was introduced by [8], as a “refinement” of the mean payoff. For instance, if local rewards represent rate of return on our investment, than maximizingφs(v0) provides us with an optimal investment policy. If how-ever local rewards are transactions in and out of our account, then Ψs(v0) is related to the current account balance, and maximizing it makes perfect sense.

We also consider the two person game version, in whichMinis an adversary of Maxand tries to minimize the same objective. In this version it is assumed that some states are controlled byMax, while the other states are controlled by Min.

Our first result counters the intuitive heuristic view of [8] cited above:

Theorem 1. Mean payoff is a special case of total payoff, in the sense that to every system with payoff function φ one can associate another system (roughly twice as many states) with payoff function ψ such that there is a one-to-one correspondence between policies and ψ_s(v₀) = φ_s⁰(v₀⁰) holds for corresponding policies sands⁰.

Our main result is about the existence and efficient computability of optimal policies:

Theorem 2. In every MDP with total effective payoffψ,Maxpossesses a uni-formly optimal deterministic positional strategy. Moreover, such a strategy, to-gether with the optimal value can be found in polynomial time.

For mean payoff MDPs, the analogous result is well-known, see, e.g. [5], [1], [3], [7]. In fact there are several known approaches to construct the optimal stationary strategies. For instance, a polynomial-time algorithm to solve mean payoff MDPs is based on solving two associated linear programs, see, e.g., [3].

Our approach for proving Theorem 2 is inspired by a result of [9]. We extend their result to characterize the existence of pure and stationary optima within allpossible strategies by the feasibility of an associated system of equations and inequalities. Next, we show that this system is always feasible and a solution can be obtained by solving a polynomial number of linear programming problems.

Theorem 3. Every two-person game with total effective payoff ψ has a value and a uniformly optimal deterministic positional equilibrium.

For the mean payoff games with perfect information the analogous result is well-known [4], [6].

The full version of our paper on these and additional results can be found at [2].

References

1. Blackwell, D.: Discrete dynamic programming. Ann. Math. Statist., 33 719–726 (1962)

2. Boros, E., Elbassioni, K., Gurvich V., Makino, K.: Markov decision pro-cesses and stochastic games with total effective payoff. Annals of Opera-tions Research, May 2018. ISSN 1572-9338. doi: 10.1007/s10479-018-2898-8.

https://doi.org/10.1007/s10479-018-2898-8

3. Derman, C.: Finite State Markov decision processes. Academic Press, New York and London (1970)

4. Gillette, D.: Stochastic games with zero stop probabilities. In: Dresher, M., Tucker, A. W. and Wolfe, P. (eds) Contribution to the Theory of Games III, volume 39 of Annals of Mathematics Studies, pp. 179–187. Princeton University Press (1957) 5. Howard, R. A.: Dynamic programming and Markov processes. Technology press and

Willey, New York (1960)

6. Liggett, T. M., Lippman, S. A.: Stochastic games with perfect information and time-average payoff. SIAM Review 4604–607 (1969)

7. Mine, V., Osaki, S.: Markovian decision process. American Elsevier Publishing Co., New York (1970)

8. Thuijsman, F., Vrieze, O. J.: The bad match, a total reward stochastic game. Op-erations Research Spektrum,9, 93–99 (1987)

9. Thuijsman, F., Vrieze, O. J.: Total reward stochastic games and sensitive average reward strategies. Journal of Optimization Theory and Application 98 175–196 (1998) ISSN 0022-3239. http://dx.doi.org/10.1023/A:1022697100194

New Integer Programming Formulations for the

In document Esztergom,Hungary,December10-12,2018ShortPapers VOCAL20188thVOCALOptimizationConference:AdvancedAlgorithms (Pldal 23-26)