1.Introduction N

(1)

Chapter 1

N ONPARAMETRIC SEQUENTIAL PREDICTION OF STATIONARY TIME SERIES

László Györfi and György Ottucsák

Abstract

We present simple procedures for the prediction of a real valued time series with side information. For squared loss (regression problem), survey the basic principles of universally consistent estimates. The prediction algorithms are based on a combination of several simple predictors. We show that if the sequence is a realization of a stationary and ergodic random process then the average of squared errors converges, almost surely, to that of the optimum, given by the Bayes predictor. We offer an analog result for the prediction of stationary gaussian processes. These prediction strategies have some consequences for0−1loss (pattern recognition problem).

Keywords: -

AMS Subject Classification: -

1. Introduction

We study the problem of sequential prediction of a real valued sequence. At each time instantt = 1,2, . . ., the predictor is asked to guess the value of the next outcome y_tof a sequence of real numbersy₁, y₂, . . . with knowledge of the pastsy₁^t⁻¹ = (y₁, . . . , y_t−1)

(2)

(where y⁰₁ denotes the empty string) and the side information vectorsx^t₁ = (x₁, . . . , x_t), wherext ∈ R^d. Thus, the predictor’s estimate, at timet, is based on the value ofx^t₁and y₁^t⁻¹. A prediction strategy is a sequenceg={g_t}^∞_t=1of functions

g_t: R^dt

×R^t⁻¹ →R so that the prediction formed at timetisg_t(x^t₁, y₁^t⁻¹).

In this study we assume that(x₁, y₁),(x₂, y₂), . . .are realizations of the random variables (X1, Y1),(X2, Y2), . . . such that{(Xn, Yn)}^∞_−∞ is a jointly stationary and ergodic process.

Afterntime instants, the normalized cumulative prediction error is L_n(g) = 1

n Xn t=1

(g_t(X₁^t, Y₁^t−1)−Y_t)². Our aim to achieve smallL_n(g)whennis large.

For this prediction problem, an example can be the forecasting daily relative prices y_t of an asset, while the side information vectorxt may contain some information on other assets in the past days or the trading volume in the previous day or some news related to the actual assets, etc. This is a widely investigated research problem. However, in the vast majority of the corresponding literature the side information is not included in the model, moreover, a parametric model (AR, MA, ARMA, ARIMA, ARCH, GARCH, etc.) is fitted to the stochastic process {Y_t}, its parameters are estimated, and a prediction is derived from the parameter estimates. (cf. Tsay [36]). Formally, this approach means that there is a parameterθsuch that the best predictor has the form

E{Y_t|Y₁^t−1}=g_t(θ, Y₁^t−1),

for a functiong_t. The parameterθis estimated from the past dataY₁^t⁻¹, and the estimate is denoted byθ. Then the data-driven predictor isˆ

g_t(ˆθ, Y₁^t⁻¹).

Here we don’t assume any parametric model, so our results are fully nonparametric. This modelling is important for financial data when the process is only approximately governed by stochastic differential equations, so the parametric modelling can be weak, moreover the error criterion of the parameter estimate (usually the maximum likelihood estimate) has no relation to the mean square error of the prediction derived. The main aim of this research is to construct predictors, called universally consistent predictors, which are consistent for all stationary time series. Such universal feature can be proven using the recent principles of nonparametric statistics and machine learning algorithms.

The results below are given in an autoregressive framework, that is, the valueY_tis pre- dicted based onX₁^tandY₁^t⁻¹. The fundamental limit for the predictability of the sequence can be determined based on a result of Algoet [2], who showed that for any prediction strategygand stationary ergodic process{(X_n, Y_n)}^∞_−∞,

lim inf

n→∞ Ln(g)≥L^∗ almost surely, (1)

(3)

where

L^∗ =E Y₀− EY₀X_−∞⁰ , Y_−∞⁻¹2

is the minimal mean squared error of any prediction for the value ofY₀ based on the infinite pastX_−∞⁰ , Y_−∞⁻¹. Note that it follows by stationarity and the martingale convergence theorem (see, e.g., Stout [34]) that

L^∗ = lim

n→∞E Y_n− EY_nX₁ⁿ, Y₁ⁿ⁻¹2

.

This lower bound gives sense to the following definition:

Definition 1 A prediction strategygis called universally consistent with respect to a class Cof stationary and ergodic processes{(Xn, Yn)}^∞_−∞, if for each process in the class,

n→∞lim Ln(g) =L^∗ almost surely.

Universally consistent strategies asymptotically achieve the best possible squared loss for all ergodic processes in the class. Algoet [1] and Morvai, Yakowitz, and Gy¨orfi [27]

proved that there exists a prediction strategy universal with respect to the class of all bounded ergodic processes. However, the prediction strategies exhibited in these papers are either very complex or have an unreasonably slow rate of convergence even for well- behaved processes.

Next we introduce several simple prediction strategies which, apart from having the above mentioned universal property of [1] and [27], promise much improved performance for “nice” processes. The algorithms build on a methodology worked out in recent years for prediction of individual sequences, see Vovk [39], Feder, Merhav, and Gutman [13], Littlestone and Warmuth [25], Cesa-Bianchi et al. [9], Kivinen and Warmuth [24], Singer and Feder [32], and Merhav and Feder [26], Cesa-Bianchi and Lugosi [10] for a survey.

An approach similar to the one of this paper was adopted by Gy¨orfi, Lugosi, and Mor- vai [21], where prediction of stationary binary sequences was addressed. There they intro- duced a simple randomized predictor which predicts asymptotically as well as the optimal predictor for all binary ergodic processes. The present setup and results differ in several important points from those of [21]. On the one hand, special properties of the squared loss function considered here allow us to avoid randomization of the predictor, and to define a significantly simpler prediction scheme. On the other hand, possible unboundedness of a real-valued process requires special care, which we demonstrate on the example of gaussian processes. We refer to Nobel [29], Singer and Feder [32], [33], Yang [37], [38] to recent closely related work.

In Section 2. we survey the basic principles of nonparametric regression estimates. In Section 3. introduce universally consistent strategies for bounded ergodic processes which are based on a combination of partitioning or kernel or nearest neighbor or generalized linear estimates. In Section 4. consider the prediction of unbounded sequences including the ergodic gaussian process. In Section 5. study the classification problem of time series.

(4)

2. Nonparametric regression estimation

2.1. The regression problem

For the prediction of time series, an important source of the basic principles is the nonparametric regression. In regression analysis one considers a random vector(X, Y), whereXis R^d-valued andY isR-valued, and one is interested how the value of the so-called response variableY depends on the value of the observation vectorX. This means that one wants to find a functionf :R^d→R, such thatf(X)is a “good approximation ofY,” that is,f(X) should be close to Y in some sense, which is equivalent to making|f(X)−Y|“small.”

SinceXandY are random vectors,|f(X)−Y|is random as well, therefore it is not clear what “small|f(X)−Y|” means. We can resolve this problem by introducing the so-called L₂risk or mean squared error off,

E|f(X)−Y|², and requiring it to be as small as possible.

So we are interested in a functionm^∗ :R^d→Rsuch that E|m^∗(X)−Y|² = min

f:R^d→RE|f(X)−Y|². Such a function can be obtained explicitly as follows. Let

m(x) =E{Y|X =x}

be the regression function. We will show that the regression function minimizes theL₂risk.

Indeed, for an arbitraryf :R^d→R, a version of the Steiner theorem implies that E|f(X)−Y|² = E|f(X)−m(X) +m(X)−Y|²

= E|f(X)−m(X)|²+E|m(X)−Y|², where we have used

E{(f(X)−m(X))(m(X)−Y)}

=E E

(f(X)−m(X))(m(X)−Y)X

=E{(f(X)−m(X))E{m(X)−Y|X}}

=E{(f(X)−m(X))(m(X)−m(X))}

= 0.

Hence,

E|f(X)−Y|² = Z

R^d

|f(x)−m(x)|²µ(dx) +E|m(X)−Y|², (2) whereµdenotes the distribution ofX. The first term is called theL₂error off. It is always nonnegative and is zero iff(x) = m(x). Therefore, m^∗(x) = m(x), i.e., the optimal approximation (with respect to theL₂risk) ofY by a function ofXis given bym(X).

(5)

2.2. Regression function estimation andL2 error

In applications the distribution of(X, Y)(and hence also the regression function) is usually unknown. Therefore it is impossible to predictY usingm(X). But it is often possible to observe data according to the distribution of(X, Y)and to estimate the regression function from these data.

To be more precise, denote by(X, Y),(X₁, Y₁),(X₂, Y₂), . . . independent and identically distributed (i.i.d.) random variables withEY²<∞. LetDnbe the set of data defined by

Dn={(X₁, Y₁), . . . ,(X_n, Y_n)}.

In the regression function estimation problem one wants to use the dataDnin order to construct an estimatemn:R^d→Rof the regression functionm. Heremn(x) =mn(x,Dn)is a measurable function ofxand the data. For simplicity, we will suppressDnin the notation and writem_n(x)instead ofm_n(x,Dn).

In general, estimates will not be equal to the regression function. To compare dif- ferent estimates, we need an error criterion which measures the difference between the regression function and an arbitrary estimate m_n. One of the key points we would like to make is that the motivation for introducing the regression function leads natu- rally to an L₂ error criterion for measuring the performance of the regression function estimate. Recall that the main goal was to find a function f such that the L₂ risk E|f(X) −Y|² is small. The minimal value of thisL2 risk is E|m(X)−Y|², and it is achieved by the regression function m. Similarly to (2), one can show that the L₂ risk E{|m_n(X)−Y|²|Dn}of an estimatem_nsatisfies

E

|m_n(X)−Y|²|Dn = Z

R^d|m_n(x)−m(x)|²µ(dx) +E|m(X)−Y|². (3) Thus theL₂risk of an estimatem_nis close to the optimal value if and only if theL₂error

Z

R^d

|m_n(x)−m(x)|²µ(dx) (4) is close to zero. Therefore we will use theL2error (4) in order to measure the quality of an estimate and we will study estimates for which thisL₂error is small.

In this section we describe the basic principles of nonparametric regression estimation:

local averaging, local modelling, global modelling (or least squares estimation), and pe- nalized modelling. (Concerning the details see Gy¨orfi et al. [19].)

Recall that the data can be written as

Y_i =m(X_i) +_i,

where_i = Y_i−m(X_i)satisfiesE(_i|X_i) = 0. ThusY_i can be considered as the sum of the value of the regression function atX_i and some error_i, where the expected value of the error is zero. This motivates the construction of the estimates by local averaging, i.e., estimation ofm(x)by the average of thoseY_i whereX_i is “close” tox. Such an estimate can be written as

mn(x) = Xn i=1

Wn,i(x)·Yi,

(6)

where the weightsW_n,i(x) =W_n,i(x, X₁, . . . , X_n) ∈Rdepend onX₁, . . . , X_n. Usually the weights are nonnegative andWn,i(x)is “small” ifXiis “far” fromx.

2.3. Partitioning estimate

An example of such an estimate is the partitioning estimate. Here one chooses a finite or countably infinite partition Pn = {A_n,1, A_n,2, . . .}ofR^d consisting of cellsA_n,j ⊆ R^d and defines, forx ∈ A_n,j, the estimate by averaging Y_i’s with the corresponding X_i’s in A_n,j, i.e.,

mn(x) = Pn

i=1I_{_X_i_∈_A_n,j_}Y_i Pn

i=1I_{X_i_∈A_n,j_} forx∈An,j, (5) whereIAdenotes the indicator function of setA, so

W_n,i(x) = I_{_X_i_∈_A_n,j_} Pn

l=1I_{_X_l_∈_A_n,j_} forx∈A_n,j.

Here and in the following we use the convention⁰₀ = 0. In order to have consistency, on the one hand we need that the cellsA_n,j should be ”small”, and on the other hand the number of non-zero terms in the denominator of (5) should be ”large”. These requirements can be satisfied if the sequences of partitionPnis asymptotically fine, i.e., if

diam(A) = sup

x,y∈Akx−yk

denotes the diameter of a set, then for each sphereScentered at the origin

nlim→∞ max

j:An,j∩S6=∅diam(A_n,j) = 0 and

nlim→∞

|{j : A_n,j∩S6=∅}|

n = 0.

For the partitionPn, the most important example is when the cellsA_n,jare cubes of volume h^d_n. For cubic partition, the consistency conditions above mean that

nlim→∞h_n= 0 and lim

n→∞nh^d_n=∞. (6)

Next we bound the rate of convergence ofEkm_n−mk²for cubic partitions and regression functions which are Lipschitz continuous.

Proposition 1 For a cubic partition with side lengthhnassume that Var(Y|X=x)≤σ², x∈R^d,

|m(x)−m(z)| ≤Ckx−zk, x, z∈R^d, (7) and thatXhas a compact supportS. Then

Ekmn−mk² ≤ c1

n·h^d_n +d·C²·h²_n,

(7)

thus for

h_n=c₂n⁻^d+2¹ we get

Ekmn−mk²≤c3n⁻^2/(d+2).

In order to prove Proposition 1 we need the following technical lemma. An integer- valued random variableB(n, p)is said to be binomially distributed with parametersnand 0≤p≤1if

P{B(n, p) =k}= n

k

p^k(1−p)ⁿ⁻^k, k= 0,1, . . . , n.

Lemma 1 Let the random variable B(n, p) be binomially distributed with parametersn andp. Then:

(i)

E

1 1 +B(n, p)

≤ 1

(n+ 1)p,

(ii)

E 1

B(n, p)I{B(n,p)>0}

≤ 2

(n+ 1)p. Proof. Part (i) follows from the following simple calculation:

E

1 1 +B(n, p)

= Xn k=0

1 k+ 1

n k

p^k(1−p)^n−k

= 1

(n+ 1)p Xn k=0

n+ 1 k+ 1

p^k+1(1−p)ⁿ⁻^k

≤ 1

(n+ 1)p

n+1X

k=0

n+ 1 k

p^k(1−p)ⁿ⁻^k+1

= 1

(n+ 1)p(p+ (1−p))ⁿ⁺¹

= 1

(n+ 1)p. For (ii) we have

E 1

B(n, p)I_{B(n,p)>0}

≤E

2 1 +B(n, p)

≤ 2

(n+ 1)p

by (i).

(8)

Proof of Proposition 1. Set

ˆ

m_n(x) =E{m_n(x)|X₁, . . . , X_n}= Pn

i=1m(X_i)I_{_X_i_∈_A_n_(x)_} nµ_n(A_n(x)) , whereµ_ndenotes the empirical distribution forX₁, . . . , X_n. Then

E{(m_n(x)−m(x))²|X₁, . . . , X_n}

= E{(m_n(x)−mˆ_n(x))²|X₁, . . . , X_n}+ ( ˆm_n(x)−m(x))². (8) We have

E{(m_n(x)−mˆ_n(x))²|X₁, . . . , X_n}

= E

(Pn

i=1(Y_i−m(X_i))I_{X_i_∈A_n_(x)}

nµ_n(A_n(x))

²X₁, . . . , X_n )

= Pn

i=1Var(Y_i|X_i)I_{X_i_∈A_n_(x)}

(nµ_n(A_n(x)))²

≤ σ²

nµ_n(A_n(x))I_{_nµ_n_(A_n_(x))>0_}. By Jensen’s inequality

( ˆmn(x)−m(x))² =

Pn

i=1(m(X_i)−m(x))I_{X_i_∈A_n_(x)}

nµ_n(A_n(x))

²

I_{_nµ_n_(A_n_(x))>0_} +m(x)²I_{_nµ_n_(A_n_(x))=0_}

≤ Pn

i=1(m(X_i)−m(x))²I_{_X_i_∈_A_n_(x)_}

nµ_n(A_n(x)) I_{_nµ_n_(A_n_(x))>0_} +m(x)²I_{nµ_n_(A_n_(x))=0}

≤ d·C²h²_nI_{_nµ_n_(A_n_(x))>0_}+m(x)²I_{_nµ_n_(A_n_(x))=0_} (by(7)and max

z∈An(x)kx−zk ≤d·h²_n)

≤ d·C²h²_n+m(x)²I_{_nµ_n_(A_n_(x))=0_}.

Without loss of generality assume thatS is a cube and the union ofA_n,1, . . . , A_n,l_n isS.

Then

ln≤ ˜c h^d_n

(9)

for some constantc˜proportional to the volume ofSand, by Lemma 1 and (8), E

Z

(mn(x)−m(x))²µ(dx)

= E

Z

(m_n(x)−mˆ_n(x))²µ(dx)

+E Z

( ˆm_n(x)−m(x))²µ(dx)

=

ln

X

j=1

E (Z

An,j

(mn(x)−mˆn(x))²µ(dx) )

+

ln

X

j=1

E (Z

An,j

( ˆmn(x)−m(x))²µ(dx) )

≤

ln

X

j=1

E

σ²µ(An,j)

nµ_n(A_n,j)I_{_µ_n_(A_n,j_)>0_}

+dC²h²_n

+

ln

X

j=1

E (Z

An,j

m(x)²µ(dx)I_{_µ_n_(A_n,j₎₌₀_} )

≤

ln

X

j=1

2σ²µ(An,j)

nµ(A_n,j) +dC²h²_n+

ln

X

j=1

Z

An,j

m(x)²µ(dx)P{µn(An,j) = 0}

≤ ln2σ²

n +dC²h²_n+ sup

z∈S

m(z)²

ln

X

j=1

µ(An,j)(1−µ(An,j))ⁿ

≤ l_n2σ²

n +dC²h²_n+l_nsup_z_∈_Sm(z)²

n sup

j

nµ(A_n,j)e^−nµ(A^n,j⁾

≤ l_n2σ²

n +dC²h²_n+l_nsup_z∈Sm(z)²e⁻¹ n

(sincesup_zze^−z =e⁻¹)

≤ (2σ²+ sup_z_∈_Sm(z)²e⁻¹)˜c

nh^d_n +dC²h²_n.

2.4. Kernel estimate

The second example of a local averaging estimate is the Nadaraya–Watson kernel estimate.

LetK :R^d→ R+be a function called the kernel function, and leth > 0be a bandwidth.

The kernel estimate is defined by

m_n(x) = Pn

i=1K

x−Xi

h

Y_i Pn

i=1K

x−Xi

h

, (9)

(10)

so

Wn,i(x) =

K

x−Xi

h

Pn

j=1K_x₋_X

j

h

.

Here the estimate is a weighted average of theY_i, where the weight ofY_i(i.e., the influence ofY_i on the value of the estimate atx) depends on the distance betweenX_i andx. For the bandwidthh = h_n, the consistency conditions are (6). If one uses the so-called naive kernel (or window kernel)K(x) =I_{k_x_k≤₁_}, then

m_n(x) = Pn

i=1I_{k_x₋_X_i_k≤_h_}Yi

Pn

i=1I_{k_x₋_X_i_k≤_h_} ,

i.e., one estimatesm(x)by averagingYi’s such that the distance betweenXi andxis not greater thanh.

In the sequel we bound the rate of convergence ofEkm_n−mk²for a naive kernel and a Lipschitz continuous regression function.

Proposition 2 For a kernel estimate with a naive kernel assume that Var(Y|X=x)≤σ², x∈R^d, and

|m(x)−m(z)| ≤Ckx−zk, x, z∈R^d, andXhas a compact supportS^∗. Then

Ekm_n−mk² ≤ c₁

n·h^d_n +C²h²_n, thus for

h_n=c₂n⁻^d+2¹ we have

Ekm_n−mk²≤c₃n⁻^2/(d+2). Proof. We proceed similarly to Proposition 1. Put

ˆ

mn(x) = Pn

i=1m(X_i)I_{X_i_∈S_x,hn_} nµ_n(S_x,h_n) ,

then we have the decomposition (8). IfB_n(x) ={nµ_n(S_x,h_n)>0}, then E{(mn(x)−mˆn(x))²|X1, . . . , Xn}

= E



 Pn

i=1(Yi−m(Xi))I_{_X_i_∈_S_x,hn_} nµ_n(S_x,h_n)

!2

|X₁, . . . , X_n





= Pn

i=1Var(Y_i|X_i)I_{_X_i_∈_S_x,hn_} (nµ_n(S_x,h_n))²

≤ σ²

nµ_n(S_x,h_n)I_B_n_(x).

(11)

By Jensen’s inequality and the Lipschitz property ofm, ( ˆmn(x)−m(x))²

=

P_n

i=1(m(X_i)−m(x))I_{_X_i_∈_S_x,hn_} nµn(S_x,h_n)

!2

I_B_n_(x)+m(x)²I_B_n_(x)c

≤ Pn

i=1(m(X_i)−m(x))²I_{_X_i_∈_S_x,hn_}

nµ_n(S_x,h_n) I_B_n_(x)+m(x)²I_B_n_(x)^c

≤ C²h²_nI_B_n_(x)+m(x)²I_B_n_(x)^c

≤ C²h²_n+m(x)²I_B_n_(x)c. Using this, together with Lemma 1,

E Z

(mn(x)−m(x))²µ(dx)

= E

Z

(m_n(x)−mˆ_n(x))²µ(dx)

+E Z

( ˆm_n(x)−m(x))²µ(dx)

≤ Z

S^∗E

σ²

nµ_n(S_x,h_n)I_{_µ_n_(S_x,hn_)>0_}

µ(dx) +C²h²_n

+ Z

S^∗En

m(x)²I_{_µ_n_(S_x,hn₎₌₀_}o µ(dx)

≤ Z

S^∗

2σ²

nµ(S_x,h_n)µ(dx) +C²h²_n+ Z

S^∗

m(x)²(1−µ(S_x,h_n))ⁿµ(dx)

≤ Z

S^∗

2σ²

nµ(S_x,h_n)µ(dx) +C²h²_n+ sup

z∈S^∗

m(z)² Z

S^∗

e^−nµ(S^x,hn⁾µ(dx)

≤ 2σ² Z

S^∗

1

nµ(S_x,h_n)µ(dx) +C²h²_n + sup

z∈S^∗

m(z)²max

u ue⁻^u Z

S^∗

1

nµ(S_x,h_n)µ(dx).

We can findz₁, . . . , z_M_nsuch that the union ofS_z₁_,rh_n_/2, . . . , S_z_Mn_,rh_n_/2 coversS^∗, and Mn≤ c˜

h^d_n. Then

Z

S^∗

1

nµ(S_x,rh_n)µ(dx) ≤

Mn

X

j=1

Z I_{x∈S

zj ,rhn/2}

nµ(S_x,rh_n) µ(dx)

≤

Mn

X

j=1

Z I_{_x_∈_S

zj ,rhn/2}

nµ(S_z_j_,rh_n_/2)µ(dx)

≤ M_n n

≤ ˜c nh^d_n.

(12)

Combining these inequalities the proof is complete.

2.5. Nearest neighbor estimate

Our final example of local averaging estimates is thek-nearest neighbor (k-NN) estimate.

Here one determines theknearestX_i’s toxin terms of distancekx−X_ikand estimates m(x)by the average of the correspondingY_i’s. More precisely, forx∈R^d, let

(X₍₁₎(x), Y₍₁₎(x)), . . . ,(X_(n)(x), Y_(n)(x)) be a permutation of

(X1, Y1), . . . ,(Xn, Yn) such that

kx−X₍₁₎(x)k ≤ · · · ≤ kx−X_(n)(x)k. Thek-NN estimate is defined by

mn(x) = 1 k

Xk i=1

Y_(i)(x). (10)

Here the weightW_ni(x)equals1/kifX_iis among theknearest neighbors ofx, and equals 0otherwise. Ifk= k_n → ∞such thatk_n/n → 0then thek-nearest-neighbor regression estimate is consistent.

Next we bound the rate of convergence of Ekm_n −mk² for a k_n-nearest neighbor estimate.

Proposition 3 Assume thatXis bounded,

σ²(x) =Var(Y|X=x)≤σ² (x∈R^d) and

|m(x)−m(z)| ≤Ckx−zk (x, z∈R^d).

Assume thatd≥3. Letm_nbe thek_n-NN estimate. Then Ekm_n−mk² ≤ σ²

kn

+c₁ k_n

n 2/d

,

thus fork_n=c₂n^d+2² ,

Ekm_n−mk²≤c₃n⁻^d+2² .

For the proof of Proposition 3 we need the rate of convergence of nearest neighbor distances.

Lemma 2 Assume thatXis bounded. Ifd≥3, then E{kX_(1,n)(X)−Xk²} ≤ ˜c

n^2/d.

(13)

Proof. For fixed >0,

P{kX_(1,n)(X)−Xk> }=E{(1−µ(S_X,))ⁿ}.

LetA₁, . . . , A_N()be a cubic partition of the bounded support ofµsuch that theA_j’s have diameterand

N()≤ c ^d. Ifx∈A_j, thenA_j ⊂S_x,, therefore

E{(1−µ(S_X,))ⁿ} =

NX() j=1

Z

Aj

(1−µ(S_x,))ⁿµ(dx)

≤

N()

X

j=1

Z

Aj

(1−µ(A_j))ⁿµ(dx)

=

N()

X

j=1

µ(A_j)(1−µ(A_j))ⁿ. Obviously,

N()X

j=1

µ(A_j)(1−µ(A_j))ⁿ ≤

N()X

j=1

maxz z(1−z)ⁿ

≤

N()X

j=1

maxz ze⁻^nz

= e⁻¹N()

n .

IfLstands for the diameter of the support ofµ, then E{kX_(1,n)(X)−Xk²} =

Z _∞

0 P{kX_(1,n)(X)−Xk² > }d

= Z L²

0 P{kX_(1,n)(X)−Xk>√ }d

≤ Z L²

0

min

1,e⁻¹N(√ ) n

d

≤ Z L²

0

minn 1, c

en⁻^d/2o d

=

Z (c/(en))^2/d 0

1d+ c en

Z L² (c/(en))^2/d

⁻^d/2d

≤ ˜c n^2/d

ford≥3.

(14)

Proof of Proposition 3. We have the decomposition

E{(m_n(x)−m(x))²} = E{(m_n(x)−E{m_n(x)|X₁, . . . , X_n})²} +E{(E{mn(x)|X1, . . . , Xn} −m(x))²}

= I₁(x) +I₂(x).

The first term is easier:

I₁(x) = E



 1 k_n

kn

X

i=1

Y_(i,n)(x)−m(X_(i,n)(x))!²





= E

( 1 k²_n

kn

X

i=1

σ²(X_(i,n)(x)) )

≤ σ² kn

.

For the second term

I₂(x) = E



 1 k_n

kn

X

i=1

(m(X_(i,n)(x))−m(x))

!²





≤ E



 1 k_n

kn

X

i=1

|m(X_(i,n)(x))−m(x)|

!²





≤ E



 1 k_n

kn

X

i=1

CkX_(i,n)(x)−xk

!²



.

Put N = k_nb_kⁿ_nc. Split the data X₁, . . . , X_n into k_n+ 1 segments such that the first knsegments have lengthb_kⁿ_nc, and let X˜_j^x be the first nearest neighbor ofx from thejth segment. ThenX˜₁^x, . . . ,X˜_k^x_n arek_ndifferent elements of{X₁, . . . , X_n}, which implies

kn

X

i=1

kX_(i,n)(x)−xk ≤

kn

X

j=1

kX˜_j^x−xk, therefore, by Jensen’s inequality,

I2(x) ≤ C²E







 1 k_n

kn

X

j=1

kX˜_j^x−xk





2





≤ C² 1 kn

kn

X

j=1

E

nkX˜_j^x−xk²o

= C²En

kX˜₁^x−xk²o

= C²En

kX_(1,_bⁿ

knc)(x)−xk²o .

(15)

Thus, by Lemma 2, 1 C²

jn k_n

k2/dZ

I2(x)µ(dx) ≤ jn k_n

k2/d

E

nkX_(1,_bⁿ

knc)(X)−Xk²o

≤ const.

2.6. Empirical error minimization

A generalization of the partitioning estimate leads to global modelling or least squares estimates. Let Pn = {A_n,1, A_n,2, . . .} be a partition of R^d and letFn be the set of all piecewise constant functions with respect to that partition, i.e.,

Fn=



 X

j

a_jI_A_n,j : a_j ∈R



. (11) Then it is easy to see that the partitioning estimate (5) satisfies

mn(·) = arg min

f∈Fn

(1 n

Xn i=1

|f(Xi)−Yi|² )

. (12)

Hence it minimizes the empiricalL₂ risk 1 n

Xn i=1

|f(Xi)−Yi|² (13) over Fn. Least squares estimates are defined by minimizing the empirical L₂ risk over a general set of functions Fn (instead of (11)). Observe that it doesn’t make sense to minimize (13) over all functionsf, because this may lead to a function which interpolates the data and hence is not a reasonable estimate. Thus one has to restrict the set of functions over which one minimizes the empiricalL₂ risk. Examples of possible choices of the set Fn are sets of piecewise polynomials with respect to a partition Pn, or sets of smooth piecewise polynomials (splines). The use of spline spaces ensures that the estimate is a smooth function. An important member of least squares estimates is the generalized linear estimates. Let{φ_j}^∞j=1 be real-valued functions defined onR^dand letFnbe defined by

Fn=



f;f =

`n

X

j=1

cjφj



. Then the generalized linear estimate is defined by

m_n(·) = arg min

f∈Fn

(1 n

Xn i=1

(f(X_i)−Y_i)² )

= arg min

c1,...,c_`n



 1 n

Xn i=1





`n

X

j=1

cjφj(Xi)−Yi





2



.

(16)

If the set 



 X` j=1

c_jφ_j; (c₁, . . . , c_`), `= 1,2, . . .





is dense in the set of continuous functions of dvariables,`_n → ∞and`_n/n → 0 then the generalized linear regression estimate defined above is consistent. For least squares estimates, other example can be the neural networks or radial basis functions or orthogonal series estimates.

Next we bound the rate of convergence of empirical error minimization estimates.

Condition (sG). The errorε:= Y −m(X)is subGaussian random variable, that is, there exist constantsλ >0andΛ<∞with

E

exp(λε²)X <Λ a.s. Furthermore, defineσ² :=E{ε²}and setλ0 = 4Λ/λ.

Condition (C). The classFnis totally bounded with respect to the supremum norm. For eachδ >0, letM(δ)denote theδ-covering number ofF. This means that for everyδ >0, there is aδ-coverf₁, . . . , f_M withM =M(δ)such that

1≤mini≤Msup

x |f_i(x)−f(x)| ≤δ

for allf ∈ Fn. In addition, assume thatFnis uniformly bounded byL, that is,

|f(x)| ≤L <∞ for allx∈Randf ∈ Fn.

Proposition 4 Assume that conditions (C) and (sG) hold and

|m(x)| ≤L <∞.

Then, for the estimatem_ndefined by (12) and for allδ_n>0,n≥1, E

(m_n(X)−m(X))²

≤ 2 inf

f∈FnE{(f(X)−m(X))²} +(16L+ 4σ)δ_n+

16L²+ 4 maxn Lp

2λ₀,8λ₀ologM(δ_n)

n .

In the proof of this proposition we use the following lemmata:

Lemma 3 (Wegkamp [41]) LetZ be a random variable with E{Z}= 0 andE

exp(λZ²) ≤A for some constantsλ >0andA≥1. Then

E{exp(βZ)} ≤exp

2Aβ² λ

holds for everyβ ∈R.

(17)

Proof. Since for allt > 0, P{|Z| > t} ≤ Aexp(−λt²) holds, we have for all integers m≥2,

E{|Z|^m}= Z _∞

0 P

|Z|^m > t dt≤A Z _∞

0

exp

−λt^2/m

dt=Aλ^−m/2Γm 2 + 1

.

Note thatΓ²(^m₂ + 1)≤Γ(m+ 1)by Cauchy-Schwarz. The following inequalities are now self-evident.

E{exp (βZ)} = 1 + X∞ m=2

1

m!E(βZ)^m

≤ 1 + X∞ m=2

1

m!|β|^mE|Z|^m

≤ 1 +A X∞ m=2

λ⁻^m/2|β|^mΓ ^m₂ + 1 Γ (m+ 1)

≤ 1 +A X∞ m=2

λ⁻^m/2|β|^m 1 Γ ^m₂ + 1

= 1 +A X∞ m=1

β² λ

m

1 Γ (m+ 1) +A

X∞ m=1

β² λ

m+¹₂

1 Γ m+ ³₂

≤ 1 +A X∞ m=1

β² λ

m

1 + β²

λ ¹₂!

1 Γ (m+ 1). Finally, invoke the inequality1 + (1 +√

x)(exp(x)−1)≤exp(2x)forx > 0, to obtain

the result.

Lemma 4 (Antos, Gy¨orfi, Gy¨orgy [3]) Let X_ij, i = 1, . . . , n, j = 1, . . . M be random variables such that for each fixed j, X_1j, . . . , X_nj are independent and identically dis- tributed such that for eachs₀≥s >0

E{e^sX^ij} ≤e^s²^σ^j². Forδj >0, put

ϑ= min

j≤M

δ_j σ_j². Then

E (

maxj≤M

1 n

Xn i=1

X_ij−δ_j

!)

≤ logM

min{ϑ, s₀}n. (14) If

E{X_ij}= 0

(18)

and

|Xij| ≤K, then

E (

maxj≤M

1 n

Xn i=1

Xij −δj

!)

≤max{1/ϑ^∗, K}logM

n , (15)

where

ϑ^∗ = min

j≤M

δ_j Var(X_ij). Proof. For the notation

Y_j = 1 n

Xn i=1

X_ij−δ_j we have that for anys₀≥s >0

E{e^snY^j} = E{e^sn(_n¹^Pⁿi=1Xij−δj)}

= e^−snδ^j E{e^sX^1j}n

≤ e⁻^snδ^je^ns²^σ^j²

≤ e⁻^snασ²^j^+s²^nσ^j². Thus

e^snE{^max^j≤M^Y^j^} ≤ E{e^snmax^j≤M^Y^j}

= E{max

j≤Me^snY^j}

≤ X

j≤M

E{e^snY^j}

≤ X

j≤M

e⁻^snσ^j²^(α⁻^s).

Fors= min{α, s₀}it implies that E{max

j≤MYj} ≤ 1 snlog



X

j≤M

e⁻^snσ²^j^(α⁻^s)



≤ logM min{α, s₀}n.

In order to prove the second half of the lemma, notice that, for anyL >0and|x| ≤Lwe have the inequality

e^x = 1 +x+x² X∞ i=2

xⁱ⁻² i!

≤ 1 +x+x² X∞ i=2

Lⁱ⁻² i!

= 1 +x+x²e^L−1−L L² ,

(19)

therefore0< s≤s₀ =L/K implies thats|X_ij| ≤L, so

e^sX^ij ≤1 +sXij + (sXij)²e^L−1−L L² . Thus,

E{e^sX^ij} ≤1 +s²Var(X_ij)e^L−1−L

L² ≤e^s²^Var^(X^ij⁾^eL−1−L^L² ,

so (15) follows from (14).

Proof of Proposition 4. This proof is due to Gy¨orfi and Wegkamp [23] Set

D(f) =E{(f(X)−Y)²} and

D(f) =b Xn i=1

(f(Xi)−Yi)² and

∆_f(x) = (m(x)−f(x))² and define

R(Fn) := sup

f∈Fn

hD(f)−2D(fb )i

≤R₁(Fn) +R₂(Fn), where

R1(Fn) := sup

f∈Fn

h2 n

Xn i=1

{E∆_f(Xi)−∆_f(Xi)} −1

2E{∆_f(X)}i and

R₂(Fn) := sup

f∈Fn

h4 n

Xn i=1

ε_i(f(X_i)−m(X_i))−1

2E{∆_f(X)}i ,

withεi:=Yi−m(Xi). By the definition ofR(Fn)andmn, we have for allf ∈ Fn

E

(m_n(X)−m(X))²| Dn = E{D(m_n)| Dn} −D(m)

≤ 2{D(mb n)−D(m)b }+R(Fn)

≤ 2{D(fb )−D(m)b }+R(Fn) After taking expectations on both sides, we obtain

E

(m_n(X)−m(X))² ≤2E

(f(X)−m(X))² +E{R(Fn)}.

LetFn⁰ be a finiteδn-covering net (with respect to the sup-norm) ofFnwithM(δn) =|Fn⁰|. It means that for anyf ∈ Fnthere is anf⁰ ∈ Fn⁰ such that

sup

x |f(x)−f⁰(x)| ≤δ_n,

(20)

which implies that

|(m(X_i)−f(X_i))²−(m(X_i)−f⁰(X_i))²|

≤ |f(Xi)−f⁰(Xi)| · |m(Xi)−f(Xi)|+|m(Xi)−f⁰(Xi)|

≤ 4L|f(X_i)−f⁰(X_i)|

≤ 4Lδ_n,

and, by Cauchy-Schwarz inequality,

E{|εi(m(Xi)−f(Xi))−εi(m(Xi)−f⁰(Xi))|}

≤ q

E{ε²_i}p

E{(f(X_i)−f⁰(X_i))²}

≤ σδ_n. Thus,

E{R(Fn)} ≤2δ_n(4L+σ) +E{R(Fn⁰)}, and therefore

E

(m_n(X)−m(X))²

≤ 2E

(f(X)−m(X))² +E{R(Fn)}

≤ 2E

(f(X)−m(X))² + (16L+ 4σ)δ_n+E R(Fn⁰)

≤ 2E

(f(X)−m(X))² + (16L+ 4σ)δ_n+E

R₁(Fn⁰) +E

R₂(Fn⁰) . Define, for allf ∈ FnwithD(f)> D(m),

˜

ρ(f) := E

(m(X)−f(X))⁴ E{(m(X)−f(X))²} Since|m(x)| ≤1and|f(x)| ≤1, we have that

˜

ρ(f)≤4L² Invoke the second part of Lemma 4 below to obtain

E

R₁(Fn⁰) ≤ max 8L²,4L² sup

f∈Fn⁰

˜ ρ(f)

!logM(δ_n) n

≤ max 8L²,16L²logM(δ_n) n

= 16L²logM(δ_n)

n .

By Condition (sG) and Lemma 3, we have for alls >0,

E{exp (sε(f(X)−m(X)))|X} ≤exp(λ₀s²(m(X)−f(X))²/2).

For|z| ≤1, apply the inequalitye^z ≤1 + 2z. Choose s0= 1

L√ 2λ₀,

(21)

then 1

2λ₀s²(f(X)−m(X))²≤1, therefore, for0< s≤s0,

E{exp (sε(f(X)−m(X)))} ≤ E

exp 1

2λ₀s²(f(X)−m(X))²

≤ 1 +λ₀s²E

(f(X)−m(X))²

≤ exp λ₀s²E

(f(X)−m(X))² .

Next we invoke the first part of Lemma 4. We find that the valueϑin Lemma 4 becomes 1/ϑ= 8 sup

f∈Fn⁰

λ₀E{(f(X)−m(X))²}

E{∆_f(X)} ≤8λ₀, and we get

E

R₂(Fn⁰) ≤4logM(δ_n)

n max

Lp

2λ₀,8λ₀ , and this completes the proof of Proposition 4.

Instead of restricting the set of functions over which one minimizes, one can also add a penalty term to the functional to be minimized. LetJ_n(f) ≥0be a penalty term penal- izing the “roughness” of a functionf. The penalized modelling or penalized least squares estimatem_nis defined by

m_n= arg min

f

(1 n

Xn i=1

|f(X_i)−Y_i|²+J_n(f) )

, (16)

where one minimizes over all measurable functions f. Again we do not require that the minimum in (16) be unique. In the case it is not unique, we randomly select one function which achieves the minimum.

A popular choice forJn(f)in the cased= 1is J_n(f) =λ_n

Z

|f⁰⁰(t)|²dt, (17) where f⁰⁰ denotes the second derivative of f and λ_n is some positive constant. One can show that for this penalty term the minimum in (16) is achieved by a cubic spline with knots at theXi’s, i.e., by a twice differentiable function which is equal to a polynomial of degree3(or less) between adjacent values of theX_i’s (a so-called smoothing spline).

3. Universally consistent predictions: bounded Y

3.1. Partition-based prediction strategies

In this section we introduce our first prediction strategy for bounded ergodic processes. We assume throughout the section that|Y₀|is bounded by a constantB >0, with probability one, and the boundBis known.