Online combinatorial optimization - Prediction by random-walk perturbation

6.2 Prediction by random-walk perturbation

6.2.3 Online combinatorial optimization

In this section we study the case of online linear optimization (see, among others, [38], [66], [40], [94], [64], [99], [56], [53], [68], [27], [5]). This is a similar prediction problem as the one described on Figure 2.1 but here each actioniis represented by a vectorvi∈R^d. The loss cor-responding to actioniat timetequalsv_i^>`twhere`t∈[0, 1]^dis the so-calledloss vector. Thus, given a set of actionsS={vi:i=1, 2, . . . ,N}⊆R^d, at every time instantt, the forecaster chooses, in a possibly randomized way, a vectorVt∈Sand suffers lossV^>_t `t. Using this notation, the re-gret becomes

t=1

V^>_t`t−min

v∈S v^>LT

whereLt=Pt

s=1`s is the cumulative loss vector. While the results of the previous section still hold when treating eachv_i ∈Sas a separate action, one may gain important computational advantage by taking the structure of the action set into account. In particular, as [64] empha-size,FPL-type forecasters may often be computed efficiently. In this section we propose such a forecaster which adds independent random-walk perturbations to theindividual components of the loss vector. To gain simplicity in the presentation, we restrict our attention to the case ofonline combinatorial optimizationin whichS⊂{0, 1}^d, that is, each action is represented a binary vector. This special case arguably contains most important applications such as the (non-stochastic) online shortest path problem. In this example, a fixed directed acyclic graph ofd edges is given with two distinguished verticesu andw. The forecaster, at every time in-stantt, chooses a directed path fromutow. Such a path is represented by it binary incidence vectorv∈{0, 1}^d. The components of the loss vector`t∈[0, 1]^drepresent losses assigned to the

dedges andv^>`t is the total loss assigned to the pathv. Another (non-essential) simplifying assumption is that every actionv∈Shas the same number of 1’s: kvk1=mfor allv∈S. The value ofmplays an important role in the bounds below.

The proposed prediction algorithm is defined as follows. LetX1, . . . ,Xn be independent Gaussian random vectors taking values inR^d such that the components of eachXt are i.i.d.

normalXi,t∼N^(0,η²) for some fixedη>0 whose value will be specified later. Denote Zt=

s=1

Xt.

The forecaster at timet, chooses the action Vt=arg min

v∈S

©v^>(Lt−1+Zt)ª , whereLt=Pt

s=1`tfort≥1 andL0=(0, . . . , 0)^>.

The next theorem bounds the performance of the proposed forecaster. Again, we are not only interested in the regret but also the number of switchesPn

t=1I{V_t+16=Vt}. The regret is of sim-ilar order—roughlymp

d T—as that of the standardFPL forecaster, up to a logarithmic factor.

Moreover, the expected number of switches isO¡

m(lnd)^5/2p T¢

. Remarkably, the dependence ondis only polylogarithmic and it is the weightmof the actions that plays an important role.

We note in passing that theSD algorithm presented in Section 6.1 can be used for simul-taneously guaranteeing that the expected regret isO(m^3/2p

Tlnd) and the expected number of switches isp

mTlnd. However, as this algorithm requires explicit computation of the expo-nential weighted forecaster, it can only be efficiently implemented for some special decision setsS—see [68] and [27] for some examples. On the other hand, our algorithm can be effi-ciently implemented whenever there exists an efficient implementation of the static optimiza-tion problem of finding arg min_v∈_Sv^>`for any`∈R^d.

Theorem 6.2. The expected regret and the expected number of action switches satisfy (under the oblivious adversary model)

Lb_T_≤mp T

µ2d η +ηp

2 lnd

+md(lnT+1) η² and

"_T X

t=1

I{V_t+16=Vt}

≤

t=1

m µ

1+2η³

2 lnd+p

2 lnd+1´ +η²³

2 lnd+p

2 lnd+1´2¶ 4η²t

t=1

m³ 1+η³

2 lnd+p

2 lnd+1´´p 2 lnd ηp

t .

In particular, settingη=q _2d

p2 lnd yields

Lb_T_≤4mp d Tp⁴

lnd+m(lnT+1)p lnd.

and

· n

t=1

I{V_t+16=Vt}

=O³

m(lnd)^5/2p T´

The proof of the regret bound is quite standard, similar to the proof of Theorem 3 in [6], and is deferred to the end of this section. The more interesting part is the bound for the ex-pected number of action switchesE£P_n

t=1I{V_t+16=Vt}

¤=Pn

t=1P[Vt+16=Vt]. The upper bound on this quantity follows from the lemma below and the well-known fact that the expected value of the maximum of the square ofdindependent standard normal random variables is at most 2 lnd+p

2 lnd+1 (see, e.g., [18]). Thus, it suffices to prove the following:

Lemma 6.6. For each t=1, 2, . . . ,T ,

P[V_t+16=Vt|Xt+1]≤mk`t+Xt+1k²_∞

2η²t +mk`t+X_t+1k_∞p 2 lnd ηp

t Proof. We use the notationPt[·]=P[· |Xt+1] andEt[·]=E[· |Xt+1]. Also, let

ht=`t+X_t+1 and Ht=

t−1

s=0

ht.

Furthermore, we will use the shorthand notationc= khtk_∞. Define the setAtas the lead pack:

At=©

w∈S^{: (w}−Vt)^>Ht≤ kw−Vtk1cª .

Observe that the choice ofcguarantees that no action outsideAtcan take the lead at timet+1, since ifw6∈At, then

(w−Vt)^>Ht≥¯

¯(w−Vt)^>ht

so (w−Vt)^>H_t+1≥0 andwcannot be the new leader. It follows that we can upper bound the probability of switching as

Pt[V_t+16=Vt]≤Pt[|At| >1] ,

which leaves us with the problem of upper boundingPt[|At| >1]. Similarly to the proof of Lemma 6.5, we start analyzingPt[|At| =1]:

Pt[|At| =1]=X

v∈SPt£

∀w6=v: (w−v)^>Ht≥ kw−vk1c¤

v∈S

y∈R

fv(y)Pt£

∀w6=v:w^>Ht≥y+ kw−vk1c¯

¯v^>Ht=y¤

d y, (6.3)

where fv is the distribution ofv^>Ht. Next we crucially use the fact that the conditional dis-tributions of correlated Gaussian random variables are also Gaussian. In particular, defining k(w,v)=(m− kw−vk1), the covariances are given as

cov¡

w^>Ht,v^>Ht¢

=η²(m− kw−vk1)t=η²k(w,v)t.

Let us organize all actionsw∈S\vinto a matrixW =(w1,w2, . . . ,w_N−1). The conditional dis-tribution ofW^>Htis an (N−1)-variate Gaussian distribution with mean

µv(y)= µ

w₁^>L_t−1+yk(w₁,v)

m ,w₂^>L_t−1+yk(w₂,v)

m , . . . ,w_N−1^> L_t−1+yk(w_N−1,v) m

¶_>

and covariance matrixΣv, given thatv^>Ht =y. DefiningK =(k(w1,v), . . . ,k(wN−1,v))^>and using the notationϕ(x)=p ¹

(2π)^N⁻¹|Σv|exp(−^x₂²), we get that Pt£

∀w6=v:w^>Ht≥y+ kw−vk1c¯

¯v^>Ht=y¤

= Z ∞

· · · Z

z_i=y+(m−k(w_i,v))c

φ³q

(z−µv(y))^>Σ⁻y¹(z−µv(y))´ d z

Z ∞

· · · Z

zi=y+(m−k(wi,v))c+k(wi,v)c

φµq

¡z−µv(y)−cK¢>Σ⁻¹y

¡z−µv(y)−cK¢

¶ d z

= Z ∞

· · · Z

zi=y+mc

φµq

¡z−µv(y+mc)¢_>

Σ⁻¹y

¡z−µv(y+mc)¢

¶ d z

=Pt£

∀w6=v:w^>Ht≥y+mc¯

¯v^>Ht=y+mc¤ , where we usedµy+mc=µy+cK. Using this, we rewrite (6.3) as

Pt[|At| =1]=X

v∈S

y∈R

fv(y)Pt£

∀w6=v:w^>Ht≥y¯

¯v^>Ht=y¤ d y

−X

v∈S

y∈R

¡fv(y)−fv(y−mc)¢ Pt£

∀w6=v:w^>Ht≥y¯

¯v^>Ht=y¤ d y

=1−X

v∈S

y∈R

¡fv(y)−fv(y−mc)¢ Pt£

∀w6=v:w^>Ht≥y¯

¯v^>Ht=y¤ d y.

To treat the remaining term, we use thatv^>Ht is Gaussian with meanv^>L_t−1 and standard deviationηp

mtand obtain

f_v(y)−f_v(y−mc)=f_v(y) µ

1−f_v(y−mc) fv(y)

≤fv(y) µmc²

2η²t −c(y−v^>Lt−1) η²t

¶ .

Thus,

Pt[|At| >1]≤X

v∈S

y∈R

¡fv(y)−fv(y−mc)¢ Pt£

∀w6=v:w^>Ht≥y¯

¯v^>Ht=y¤ d y

≤X

v∈S

y∈R

fv(y) µmc²

2η²t−c(y−v^>Lt−1) η²t

¶ Pt£

∀w6=v:w^>Ht≥y¯

¯v^>Ht=y¤ d y

=mc² 2η²t−cE£

V^>_tZt¤ η²t ≤mc²

2η²t +mcE£ kZtk_∞¤ η²t

=mkhtk²_∞

2η²t +mkhtk_∞p 2 lnd ηp

t ,

where we used the definition ofcandE£ kZtk_∞¤

≤ηp

2tlndin the last step.

Proof of the first statement of Theorem 6.2. The proof is based on the proof of Theorem 4.2 of [25] and Theorem 3 of [6]. The main difference from those proofs is that the standard deviation of our perturbations changes over time, however, this issue is very easy to treat. First, we define an infeasible “forecaster” that peeks one step into the future and uses perturbationZbt=p

tX1:

Vbt=arg min

w∈Sw^>¡ L_t+bZt

¢. Using Lemma 3.1 of [25], we get for any fixedv∈Sthat

t=1

bV^>_t (`t+(bZt−Zb_t−1))≤v^>(Ln+bZT).

After reordering, we obtain

t=1

V^>_t`t≤v^>LT+v^>ZbT+

t=1

(Vt−Vbt)^>`t−

t=1

bV^>_t(bZt−bZt−1)

=v^>LT+v^>ZbT+

t=1

(Vt−Vbt)^>`t+

t=1

t−1−p t)bV^>_t X1

The last term can be bounded as

t=1

t−1−p

t)bV^>_tX1≤

t=1

(p t−p

t−1)¯

¯ bV^>_tX1

≤m

t=1

(p t−p

t−1)kX1k_∞

≤mp

TkX1k_∞. Taking expectations, we obtain the bound

" _T X

t=1

V^>_t `t

−v^>LT ≤

t=1

E£

(Vt−bVt)^>`t¤

+ηmp 2Tlnd,

where we usedE£ kX1k_∞¤

≤ηp

2 lnd. That is, we are left with the problem of boundingE£

(Vt−Vbt)^>`t¤ for eacht≥1.

To this end, let

v(z)=arg min

w∈Sw^>z for allz∈R^d, and also

Ft(z)=v(z)^>`t.

Further, letft(z) be the density ofZt, which coincides with the density ofbZt. We have E£

V^>_t`t¤

=E[Ft(Lt−1+Zt)]

= Z

z∈R^d f_t(z)F_t(L_t−1+z)d z

= Z

z∈R^d ft(z)Ft(Lt−`t+z)d z

= Z

z∈R^d ft(z+`t)Ft(Lt+z)d z

=E£

Ft(Lt+Zbt)¤ +

z∈R^d

¡ft(z+`t)−ft(z)¢

F(Lt+z)d z

=E£ bV^>_t`t¤

+ Z

z∈R^d

¡ft(z)−ft(z−`t)¢

F(Lt−1+z)d z. The last term can be upper bounded as

z∈R^dft(z) µ

1−exp

µ(z−`t)^>`t

η²t

¶¶

Ft(Lt−1+z)d z

≤ − Z

z∈R^dft(z)

µ(z−`t)^>`t

η²t

F(Lt−1+z)d z

≤E£ V^>_t`t¤

k`tk²₂ η²t + m

η²t Z

z∈R^d ft(z)¯

¯z^>`t

¯d z

≤md η²t + m

η²t Z

z∈R^d ft(z)kzk1d z

=md η²t +

r2 π· md

ηp t, where we usedE£

kZtk1

¤=ηdp

2t/πin the last step. Putting everything together, we obtain the statement of the theorem as

· n

t=1

V^>_t`t

−v^>LT ≤

t=1

md η²t +

t=1

r2 π· md

ηp

t +ηmp 2Tlnd

≤2mdp T

η +ηmp

2Tlnd+md(lnT+1) η² .

Chapter 7

Online lossy source coding

In this chapter we consider the problem of fixed-rate sequential lossy source coding of indi-vidual sequences with limited delay. Here a source sequencez1,z2, . . . taking values from the source alphabetZhas to be transformed into a sequencey1,y2, . . . of channel symbols taking values in the finite channel alphabet {1, . . . ,M}, and these channel symbols are then used to produce the reproduction sequence ˆz1, ˆz2, . . .. The rate of the scheme is defined as lnMnats (where ln denotes the natural logarithm), and the scheme is said to haveδ1encoding andδ2

decoding delay if, for anyt=1, 2, . . ., the channel symbolytdepends onz^t^+δ¹=(z₁,z₂, . . . ,z_t+δ₁) and ˆztdepends ony^t+δ²=(y1, . . . ,yt+δ2). The goal of the coding scheme is to minimize the dis-tortion between the source sequence and the reproduction sequence. We concentrate on the individual sequence setting and aim to find methods that work uniformly well with respect to a reference coder class on every individual (deterministic) sequence. Thus, no probabilistic assumption is made on the source sequence, and the performance of a scheme is measured by the distortion redundancy defined as the maximal difference, over all source sequences of a given length, between the normalized distortion of the given coding scheme and that of the best reference coding scheme matched to the underlying source sequence.

As will be shown later, this problem is an instance of online learning with switching costs, where taking actions corresponds to selecting coding schemes, distortions correspond to neg-ative rewards and the distortion redundancy corresponds to theaverageexpected regretLb_T/T. The switching cost naturally arises from the fact that every time the coding scheme is changed on the source side, the receiver has to be informed of the new decoding scheme via the same channel that transmits the useful information. This chapter presents our works György and Neu [48] and György and Neu [49]. Our results are summarized below.

Thesis 5. Proposed an efficient algorithm for the problem of online lossy source coding. Proved performance guarantees for Setting 4. The proved bounds are optimal in the number of time steps, up to logarithmic factors. [48, 49]

7.1 Related work

The study of limited-delay (zero-delay) lossy source coding in the individual sequence set-ting was initiated by Linder and Lugosi [71], who showed the existence of randomized coding schemes that perform, on any bounded source sequence, essentially as well as the best scalar quantizer matched to the underlying sequence. More precisely, it was shown that the normal-ized squared error distortion of their scheme on any source sequencez^T of lengthT is at most O^(T^−1/5^lnT) larger than the normalized distortion of the best scalar quantizer matched to the source sequence in hindsight. The method of [71] is based on the exponentially weighted av-erage (EWA) prediction method [96, 97, 72]: at each time instant a coding scheme (a scalar quantizer) is selected based on its “estimated” performance. A major problem in this approach is that the prediction, and hence the choice of the quantizer at each time instant, is performed based on the source sequence which is not known exactly at the decoder. Therefore, in [71] in-formation about the source sequence that is used in the random choice of the quantizers is also transmitted over the channel, reducing the available capacity for actually encoding the source symbols.

The coding scheme of [71] was improved and generalized by Weissman and Merhav [100].

They considered the more general case when the reference classFis a finite set of limited-delay and limited-memory coding schemes. To reduce the communication about the actual decoder to be used at the receiver, Weissman and Merhav introduced a coding scheme where the source sequence is split into blocks of equal length, and in each block a fixed encoder-decoder pair is used with its identity communicated at the beginning of each block. Similarly to [71], the code for each block is chosen using the EWA prediction method. The resulting scheme achieves an O(T^−1/3ln^2/3|F|) distortion redundancy, or, in the case of the infinite class of scalar quantizers, the distortion redundancy becomesO^(T⁻^1/3^lnT^).

The results of [100] have been extended in various ways, but all of these works are based on the block-coding procedure described above. A disadvantage of this method is that the EWA prediction algorithm keeps one weight for each code in the reference class, and so the computational complexity of the method becomes prohibitive even for relatively simple and small reference classes. Computationally efficient solutions to the zero-delay case were given by György, Linder and Lugosi using dynamic programming [95] and EWA prediction in [43]

and based on the Follow-the-Perturbed-Leader prediction method (see [51, 64]) in [44]. The first method achieves theO(T⁻^1/3lnT) redundancy of Weissman and Merhav withO(M T^4/3) computational andO^(T^2/3) space complexity and a somewhat worseO^(T^−1/4^p^lnT) distortion redundancy with linearO^{(M T}^{) time and}O^(T^1/2) space complexity, while the second method achievesO(T⁻^1/4lnT) distortion redundancy with the sameO(M T) linear time complexity and O^{(M T}^1/4) space complexity.

Matloub and Weissman [75] extended the problem to allow a stochastic discrete memo-ryless channel between the encoder and the decoder, while Reani and Merhav [88] extended the model to the Wyner-Ziv case (i.e., when side information is also available at the decoder).

The performance bound in both cases are based on [100] while low-complexity solutions for

the zero-delay case are provided based on [44] and [43], respectively. Finally, the case when the reference class is a set of time-varying limited-delay limited-memory coding scheme was analyzed in [45], and efficient solutions were given for the zero-delay case for both traditional and network (multiple-description and multi-resolution) quantization.

Since most of the above coding schemes are based on the block-coding scheme of [100], they cannot achieve better distortion redundancy thanO^(T^−1/3) up to some logarithmic fac-tors. On the other hand, the distortion redundancy is known to be bounded from below by a constant multiple ofT⁻^1/2 in the zero-delay case [43], leaving a gap between the best known upper and lower bounds. Furthermore, if the identity of the used coding scheme were commu-nicated as side information (before the encoded symbol is revealed) the employed EWA predic-tion method would guarantee anO^(T⁻^1/2^ln|F|) distortion redundancy for any finite reference coder classF(of limited delay and limited memory), in agreement with the lower bound. Thus, to improve upon the existing coding schemes, the communication overhead (describing the ac-tually used coding schemes) between the encoder and the decoder has to be reduced, which is achievable by controlling the number of times the coding scheme changes in a better way then blockwise coding. This goal can be achieved by employing the techniques presented in Chapter 6.

In this chapter we construct a randomized coding strategy, which uses a themSDalgorithm described in Section 6.1 as the prediction component, that achieves anO(p

lnT/T) average dis-tortion redundancy with respect to a finite reference class of limited-delay and limited-memory source codes. The method can also be applied to compete with the (infinite) reference class of scalar quantizers, where it achieves anO^(lnT/p

T) distortion redundancy. Note that these bounds are only logarithmic factors larger than the corresponding lower bound.

After presenting the formalism of sequential source coding in Section 7.2, we present our randomized coding strategy in Section 7.3. The strategy is applied to the problem of adaptive zero-delay lossy source coding in Section 7.4 and further extensions are given in Section 7.5

In document Online learning in non-stationary Markov decision processes (Pldal 118-126)