• Nem Talált Eredményt

Universally consistent predictions: unbounded Y

In document 1.Introduction N (Pldal 30-39)

Theorem 2 Suppose that (26) and (27) are verified. Then the kernel-based strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic

4. Universally consistent predictions: unbounded Y

4.1. Partition-based prediction strategies

LetEbn(k,`)(xn1, yn11, z, s)be defined as in Section 3.1.. Introduce the truncation function Tm(z) =



m ifz > m z if|z|< m

−m ifz <−m, Define the elementary predictorh(k,`)by

h(k,`)n (xn1, y1n1) =Tnδ

Ebn(k,`)(xn1, y1n1, G`(xnnk), F`(ynnk1)) ,

where

0< δ <1/8,

forn= 1,2, . . .. That is,h(k,`)n is the truncation of the elementary predictor introduced in Section 3.1..

The proposed prediction algorithm proceeds as follows: let{qk,`}be a probability dis-tribution on the set of all pairs(k, `)of positive integers such that for allk, `,qk,`>0. For a time dependent learning parameterηt>0, define the weights

wt,k,`=qk,`eηt(t1)Lt−1(h(k,`)) (31) and their normalized values

pt,k,`= wt,k,`

Wt , (32)

where

Wt= X i,j=1

wt,i,j . (33)

The prediction strategygis defined by gt(xt1, yt11) =

X k,`=1

pt,k,`h(k,`)(xt1, y1t1), t= 1,2, . . . (34)

Theorem 5 (GYORFI AND¨ OTTUCSAK´ [22]) Assume that the conditions (a), (b), (c) and (d) of Theorem 1 are satisfied. Choose ηt = 1/√

t. Then the prediction scheme g de-fined above is universally consistent with respect to the class of all ergodic processes {(Xn, Yn)}−∞such that

E{Y14}<∞.

Here we describe a result, which is used in the analysis. This lemma is a modification of the analysis of Auer et al. [4], which allows of the handling the case when the learn-ing parameter of the algorithm (ηt) is time-dependent and the number of the elementary predictors is infinite.

Lemma 7 (GYORFI AND¨ OTTUCSAK´ [22]) Leth(1), h(2), . . . be a sequence of prediction strategies (experts). Let {qk}be a probability distribution on the set of positive integers.

Denote the normalized loss of the experth= (h1, h2, . . .)by

and the loss function`is convex in its first argumenth. Define wt,k=qkeηt(t1)Lt−1(h(k)) whereηt>0is monotonically decreasing, and

pt,k= wt,k

Proof. Introduce some notations:

w0t,k=qkeηt−1(t1)Lt−1(h(k)),

We start the proof with the following chain of bounds:

1

because ofe−x ≤1−x+x2/2forx≥0. Moreover, convexity of the loss`(h, y)in its first argumenth. From (37) after rearranging we obtain

`t(g)≤ −1 Then write a telescope formula:

1

ηt+1

ηt ≤1, therefore applying Jensen’s inequality for concave function, we get that Wt+1=

We can summarize the bounds:

Ln(g)≤inf Proof of Theorem 5. Because of (1), it is enough to show that

lim sup

n→∞ Ln(g)≤L a.s.

Because of the proof of Theorem 1, asn→ ∞, a.s.,

Ebn(k,`)(X1n, Y1n1, z, s)→E{Y0 |G`(X−k0 ) =z, F`(Y−k1) =s}, and therefore for allzands

Tnδ

In the same way as in the proof of Theorem 1, we get that square loss is convex in its first argumenth, so

Ln(g) ≤ inf

On the one hand, almost surely, lim sup On the other hand,

1

where we applied thatE{Y14} < ∞and0 < δ < 18. Summarizing these bounds, we get that, almost surely,

lim sup

n→∞ Ln(g)≤L

and the proof of the theorem is finished.

Corollary 2 (GYORFI AND¨ OTTUCSAK´ [22]) Under the conditions of Theorem 5,

nlim→∞

Proof. By Theorem 5,

n→∞lim

and by the ergodic theorem we have

nlim→∞

where (42) holds because of (40) and (41). The second sum is 1

by the ergodic theorem. Put

Zt=gt(X1t, Y1t1)(Yt−E{Yt|X−∞t , Y−∞t1}).

In order to finish the proof it suffices to show

nlim→∞

1 n

Xn t=1

Zt= 0. (43)

Then

E{Zt|X−∞t , Y−∞t1}= 0,

for allt, so theZt’s form a martingale difference sequence. By the strong law of large num-bers for martingale differences due to Chow [11], one has to verify (25). By the construction ofgn,

E

Zn2 = En

gn(X1n, Y1n−1)(Yn−E{Yn|X−∞n , Y−∞n−1})2o

≤ E

gn(X1n, Y1n1)2Yn2

≤ nE Y12 ,

therefore (25) is verified, (43) is proved and the proof of the corollary is finished.

4.2. Kernel-based prediction strategies

Apply the notations of Section 3.2.. Then the elementary experth(k,`)n at timenis defined by

h(k,`)n (xn1, yn11) =Tmin{nδ,`}

P

{tJn(k,`)}yt

|Jn(k,`)|

!

, n > k+ 1,

where0/0is defined to be0and0< δ <1/8. The pool of experts is mixed the same way as in the case of the partition-based strategy (cf. (31), (32), (33) and (34)).

Theorem 6 (BIAU ET AL [6]) Chooseηt = 1/√

t and suppose that (26) and (27) are verified. Then the kernel-based strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes{(Xn, Yn)}−∞such that

E{Y04}<∞. 4.3. Nearest neighbor-based prediction strategy

Apply the notations of Section 3.3. Then the elementary experth(k,`)n at timenis defined by

h(k,`)n (xn1, yn11) =Tmin{nδ,`}

P

{tJn(k,`)}yt

|Jn(k,`)|

!

, n > k+ 1,

if the sum is nonvoid, and0otherwise and0< δ <1/8. The pool of experts is mixed the same way as in the case of the histogram-based strategy (cf. (31), (32), (33) and (34)).

Theorem 7 (BIAU ET AL [6]) Chooseηt = 1/√

tand suppose that (29) is verified. Sup-pose also that for each vectorsthe random variable

k(X1k+1, Y1k)−sk

has a continuous distribution function. Then the nearest neighbor strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes {(Xn, Yn)}−∞such that

E{Y04}<∞. 4.4. Generalized linear estimates

Apply the notations of Section 3.4. The elementary predictorh(k,`)n generates a prediction of form

h(k,`)n (xn1, yn11) =Tmin{nδ,`}

 X` j=1

cn,jφ(k)j (xnnk, ynnk1)

 ,

with 0 < δ < 1/8. The pool of experts is mixed the same way as in the case of the histogram-based strategy (cf. (31), (32), (33) and (34)).

Theorem 8 (BIAU ET AL [6]) Chooseηt = 1/√

tand suppose that(k)j | ≤ 1 and, for any fixedk, suppose that the set



 X` j=1

cjφ(k)j ; (c1, . . . , c`), `= 1,2, . . .



is dense in the set of continuous functions ofd(k+ 1) +kvariables. Then the generalized linear strategy defined above is universally consistent with respect to the class of all jointly stationary and ergodic processes{(Xn, Yn)}−∞such that

E{Y04}<∞. 4.5. Prediction of gaussian processes

We consider in this section the classical problem of gaussian time series prediction (cf.

Brockwell and Davis [8]). In this context, parametric models based on distribution assump-tions and structural condiassump-tions such as AR(p), MA(q), ARMA(p,q) and ARIMA(p,d,q) are usually fitted to the data (cf. Gerencs´er and Rissanen [16], Gerencs´er [14, 15], Goldensh-luger and Zeevi [17]). However, in the spirit of modern nonparametric inference, we try to avoid such restrictions on the process structure. Thus, we only assume that we observe a string realizationyn11 of a zero mean, stationary and ergodic, gaussian process{Yn}−∞, and try to predictyn, the value of the process at timen. Note that there is no side informa-tion vectorsxn1 in this purely time series prediction framework.

It is well known for gaussian time series that the best predictor is a linear function of the past:

E{Yn|Yn1, Yn2, . . .}= X j=1

cjYnj,

where thecj minimize the criterion

E



X

j=1

cjYnj−Yn

2

.

Following Gy¨orfi and Lugosi [20], we extend the principle of generalized linear esti-mates to the prediction of gaussian time series by considering the special case

φ(k)j (ynn1k) =ynjI{1jk}, i.e.,

˜h(k)n (y1n1) = Xk j=1

cn,jyn−j.

Once again, the coefficientscn,j are calculated according to the past observationsyn11 by minimizing the criterion:

n1

X

t=k+1

 Xk j=1

cjytj−yt

2

ifn > k, and the all-zero vector otherwise.

With respect to the combination of elementary experts˜h(k), Gy¨orfi and Lugosi applied in [20] the so-called “doubling-trick”, which means that the time axis is segmented into exponentially increasing epochs and at the beginning of each epoch the forecaster is reset.

In this section we propose a much simpler procedure which avoids in particular the doubling-trick. To begin, we set

h(k)n (y1n−1) =Tmin{nδ,k}

˜h(k)n (y1n−1) ,

where0< δ < 18, and combine these experts as before. Precisely, let{qk}be an arbitrarily probability distribution over the positive integers such that for allk,qk>0, and forηn>0, define the weights

wk,n =qkeηn(n1)Ln−1(h(k)n ) and their normalized values

pk,n= wk,n P

i=1wi,n. The prediction strategygat timenis defined by

gn(yn−11 ) = X k=1

pk,nh(k)n (yn−11 ), n= 1,2, . . .

Theorem 9 (BIAU ET AL[6]) Chooseηt= 1/√

t. Then the prediction strategygdefined above is universally consistent with respect to the class of all jointly stationary and ergodic zero-mean gaussian processes{Yn}−∞.

The following corollary shows that the strategygprovides asymptotically a good esti-mate of the regression function in the following sense:

Corollary 3 (BIAU ET AL[6]) Under the conditions of Theorem 9,

nlim→∞

1 n

Xn t=1

E{Yt|Y1t1} −g(Y1t1)2

= 0 almost surely.

Corollary 3 is expressed in terms of an almost sure Ces´aro consistency. It is an open problem to know whether there exists a prediction rulegsuch that

nlim→∞ E{Yn|Y1n1} −g(Y1n1)

= 0 almost surely (44)

for all stationary and ergodic gaussian processes. Sch¨afer [31] proved that, under some additional mild conditions on the gaussian time series, the consistency (44) holds.

In document 1.Introduction N (Pldal 30-39)