Universally consistent predictions: unbounded Y

Theorem 2 Suppose that (26) and (27) are verified. Then the kernel-based strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic

4. Universally consistent predictions: unbounded Y

4.1. Partition-based prediction strategies

LetEbn^(k,`)(xⁿ₁, yⁿ₁⁻¹, z, s)be defined as in Section 3.1.. Introduce the truncation function T_m(z) =





m ifz > m z if|z|< m

−m ifz <−m, Define the elementary predictorh^(k,`)by

h^(k,`)_n (xⁿ₁, y₁ⁿ⁻¹) =T_nδ

Eb_n^(k,`)(xⁿ₁, y₁ⁿ⁻¹, G_`(xⁿ_n₋_k), F_`(yⁿ_n⁻₋_k¹)) ,

where

0< δ <1/8,

forn= 1,2, . . .. That is,h^(k,`)_n is the truncation of the elementary predictor introduced in Section 3.1..

The proposed prediction algorithm proceeds as follows: let{q_k,`}be a probability dis-tribution on the set of all pairs(k, `)of positive integers such that for allk, `,q_k,`>0. For a time dependent learning parameterη_t>0, define the weights

w_t,k,`=q_k,`e⁻^η^t^(t⁻^1)L^t−1^(h^(k,`)⁾ (31) and their normalized values

p_t,k,`= w_t,k,`

W_t , (32)

where

W_t= X∞ i,j=1

w_t,i,j . (33)

The prediction strategygis defined by g_t(x^t₁, y^t₁⁻¹) =

X∞ k,`=1

p_t,k,`h^(k,`)(x^t₁, y₁^t⁻¹), t= 1,2, . . . (34)

Theorem 5 (GYORFI AND¨ OTTUCSAK´ [22]) Assume that the conditions (a), (b), (c) and (d) of Theorem 1 are satisfied. Choose η_t = 1/√

t. Then the prediction scheme g de-fined above is universally consistent with respect to the class of all ergodic processes {(Xn, Yn)}^∞_−∞such that

E{Y₁⁴}<∞.

Here we describe a result, which is used in the analysis. This lemma is a modification of the analysis of Auer et al. [4], which allows of the handling the case when the learn-ing parameter of the algorithm (η_t) is time-dependent and the number of the elementary predictors is infinite.

Lemma 7 (GYORFI AND¨ OTTUCSAK´ [22]) Leth⁽¹⁾, h⁽²⁾, . . . be a sequence of prediction strategies (experts). Let {qk}be a probability distribution on the set of positive integers.

Denote the normalized loss of the experth= (h₁, h₂, . . .)by

and the loss function`is convex in its first argumenth. Define w_t,k=q_ke⁻^η^t^(t⁻^1)L^t−1^(h^(k)⁾ whereη_t>0is monotonically decreasing, and

p_t,k= w_t,k

Proof. Introduce some notations:

w⁰_t,k=q_ke⁻^η^t−1^(t⁻^1)L^t−1^(h^(k)⁾,

We start the proof with the following chain of bounds:

because ofe^−x ≤1−x+x²/2forx≥0. Moreover, convexity of the loss`(h, y)in its first argumenth. From (37) after rearranging we obtain

`_t(g)≤ −1 Then write a telescope formula:

ηt+1

ηt ≤1, therefore applying Jensen’s inequality for concave function, we get that W_t+1=

We can summarize the bounds:

L_n(g)≤inf Proof of Theorem 5. Because of (1), it is enough to show that

lim sup

n→∞ L_n(g)≤L^∗ a.s.

Because of the proof of Theorem 1, asn→ ∞, a.s.,

Eb_n^(k,`)(X₁ⁿ, Y₁ⁿ⁻¹, z, s)→E{Y₀ |G_`(X_−k⁰ ) =z, F_`(Y_−k⁻¹) =s}, and therefore for allzands

T_nδ

In the same way as in the proof of Theorem 1, we get that square loss is convex in its first argumenth, so

Ln(g) ≤ inf

On the one hand, almost surely, lim sup On the other hand,

where we applied thatE{Y₁⁴} < ∞and0 < δ < ¹₈. Summarizing these bounds, we get that, almost surely,

lim sup

n→∞ L_n(g)≤L^∗

and the proof of the theorem is finished.

Corollary 2 (GYORFI AND¨ OTTUCSAK´ [22]) Under the conditions of Theorem 5,

nlim→∞

Proof. By Theorem 5,

n→∞lim

and by the ergodic theorem we have

nlim→∞

where (42) holds because of (40) and (41). The second sum is 1

by the ergodic theorem. Put

Z_t=g_t(X₁^t, Y₁^t⁻¹)(Y_t−E{Y_t|X_−∞^t , Y_−∞^t⁻¹}).

In order to finish the proof it suffices to show

nlim→∞

1 n

Xn t=1

Z_t= 0. (43)

Then

E{Z_t|X_−∞^t , Y_−∞^t⁻¹}= 0,

for allt, so theZ_t’s form a martingale difference sequence. By the strong law of large num-bers for martingale differences due to Chow [11], one has to verify (25). By the construction ofg_n,

Z_n² = En

g_n(X₁ⁿ, Y₁ⁿ⁻¹)(Y_n−E{Y_n|X_−∞ⁿ , Y_−∞ⁿ⁻¹})2o

≤ E

g_n(X₁ⁿ, Y₁ⁿ⁻¹)²Y_n²

≤ n^2δE Y₁² ,

therefore (25) is verified, (43) is proved and the proof of the corollary is finished.

4.2. Kernel-based prediction strategies

Apply the notations of Section 3.2.. Then the elementary experth^(k,`)n at timenis defined by

h^(k,`)_n (xⁿ₁, yⁿ₁⁻¹) =T_min_{_nδ,`}

{t∈Jn^(k,`)}y_t

|Jn^(k,`)|

, n > k+ 1,

where0/0is defined to be0and0< δ <1/8. The pool of experts is mixed the same way as in the case of the partition-based strategy (cf. (31), (32), (33) and (34)).

Theorem 6 (BIAU ET AL [6]) Chooseη_t = 1/√

t and suppose that (26) and (27) are verified. Then the kernel-based strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes{(Xn, Yn)}^∞_−∞such that

E{Y₀⁴}<∞. 4.3. Nearest neighbor-based prediction strategy

Apply the notations of Section 3.3. Then the elementary experth^(k,`)n at timenis defined by

h^(k,`)_n (xⁿ₁, yⁿ₁⁻¹) =T_min_{_nδ,`}

{t∈Jn^(k,`)}y_t

|Jn^(k,`)|

, n > k+ 1,

if the sum is nonvoid, and0otherwise and0< δ <1/8. The pool of experts is mixed the same way as in the case of the histogram-based strategy (cf. (31), (32), (33) and (34)).

Theorem 7 (BIAU ET AL [6]) Chooseη_t = 1/√

tand suppose that (29) is verified. Sup-pose also that for each vectorsthe random variable

k(X₁^k+1, Y₁^k)−sk

has a continuous distribution function. Then the nearest neighbor strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes {(X_n, Y_n)}^∞_−∞such that

E{Y₀⁴}<∞. 4.4. Generalized linear estimates

Apply the notations of Section 3.4. The elementary predictorh^(k,`)n generates a prediction of form

h^(k,`)_n (xⁿ₁, yⁿ₁⁻¹) =T_min_{_nδ,`}



 X` j=1

c_n,jφ^(k)_j (xⁿ_n₋_k, yⁿ_n₋⁻_k¹)



 ,

with 0 < δ < 1/8. The pool of experts is mixed the same way as in the case of the histogram-based strategy (cf. (31), (32), (33) and (34)).

Theorem 8 (BIAU ET AL [6]) Chooseηt = 1/√

tand suppose that|φ^(k)_j | ≤ 1 and, for any fixedk, suppose that the set



 X` j=1

c_jφ^(k)_j ; (c₁, . . . , c_`), `= 1,2, . . .





is dense in the set of continuous functions ofd(k+ 1) +kvariables. Then the generalized linear strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes{(X_n, Y_n)}^∞_−∞such that

E{Y₀⁴}<∞. 4.5. Prediction of gaussian processes

We consider in this section the classical problem of gaussian time series prediction (cf.

Brockwell and Davis [8]). In this context, parametric models based on distribution assump-tions and structural condiassump-tions such as AR(p), MA(q), ARMA(p,q) and ARIMA(p,d,q) are usually fitted to the data (cf. Gerencs´er and Rissanen [16], Gerencs´er [14, 15], Goldensh-luger and Zeevi [17]). However, in the spirit of modern nonparametric inference, we try to avoid such restrictions on the process structure. Thus, we only assume that we observe a string realizationyⁿ₁⁻¹ of a zero mean, stationary and ergodic, gaussian process{Y_n}^∞_−∞, and try to predicty_n, the value of the process at timen. Note that there is no side informa-tion vectorsxⁿ₁ in this purely time series prediction framework.

It is well known for gaussian time series that the best predictor is a linear function of the past:

E{Y_n|Y_n₋₁, Y_n₋₂, . . .}= X∞ j=1

c^∗_jY_n₋_j,

where thec^∗_j minimize the criterion







X^∞

j=1

c_jY_n₋_j−Y_n





2



.

Following Gy¨orfi and Lugosi [20], we extend the principle of generalized linear esti-mates to the prediction of gaussian time series by considering the special case

φ^(k)_j (y_nⁿ⁻₋¹_k) =yn−jI_{₁_≤_j_≤_k_}, i.e.,

˜h^(k)_n (y₁ⁿ⁻¹) = Xk j=1

c_n,jy_n−j.

Once again, the coefficientsc_n,j are calculated according to the past observationsyⁿ₁⁻¹ by minimizing the criterion:

n−1

t=k+1



 Xk j=1

c_jy_t₋_j−y_t





ifn > k, and the all-zero vector otherwise.

With respect to the combination of elementary experts˜h^(k), Gy¨orfi and Lugosi applied in [20] the so-called “doubling-trick”, which means that the time axis is segmented into exponentially increasing epochs and at the beginning of each epoch the forecaster is reset.

In this section we propose a much simpler procedure which avoids in particular the doubling-trick. To begin, we set

h^(k)_n (y₁ⁿ⁻¹) =T_min{nδ,k}

˜h^(k)_n (y₁ⁿ⁻¹) ,

where0< δ < ¹₈, and combine these experts as before. Precisely, let{q_k}be an arbitrarily probability distribution over the positive integers such that for allk,q_k>0, and forη_n>0, define the weights

w_k,n =q_ke⁻^ηⁿ⁽ⁿ⁻^1)Lⁿ⁻¹^(h^(k)ⁿ ⁾ and their normalized values

pk,n= w_k,n P_∞

i=1w_i,n. The prediction strategygat timenis defined by

g_n(yⁿ⁻¹₁ ) = X∞ k=1

p_k,nh^(k)_n (yⁿ⁻¹₁ ), n= 1,2, . . .

Theorem 9 (BIAU ET AL[6]) Chooseη_t= 1/√

t. Then the prediction strategygdefined above is universally consistent with respect to the class of all jointly stationary and ergodic zero-mean gaussian processes{Y_n}^∞_−∞.

The following corollary shows that the strategygprovides asymptotically a good esti-mate of the regression function in the following sense:

Corollary 3 (BIAU ET AL[6]) Under the conditions of Theorem 9,

nlim→∞

1 n

Xn t=1

E{Y_t|Y₁^t⁻¹} −g(Y₁^t⁻¹)2

= 0 almost surely.

Corollary 3 is expressed in terms of an almost sure Ces´aro consistency. It is an open problem to know whether there exists a prediction rulegsuch that

nlim→∞ E{Yn|Y₁ⁿ⁻¹} −g(Y₁ⁿ⁻¹)

= 0 almost surely (44)

for all stationary and ergodic gaussian processes. Sch¨afer [31] proved that, under some additional mild conditions on the gaussian time series, the consistency (44) holds.

In document 1.Introduction N (Pldal 30-39)