On the supremum of partial sums of independent random variables

(1)

“Broadening the knowledge base and supporting the long term professional sustainability of the Research University Centre of Excellence

at the University of Szeged by ensuring the rising generation of excellent scientists.””

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

On the supremum of partial sums of independent random variables

Péter Major

(Alfréd Rényi Institute of Mathematics)

TÁMOP‐4.2.2/B‐10/1‐2010‐0012 project

(2)

On the supremum of partial sums of independent random variables

P´eter Major

Mathematical Institute of the Hungarian Academy of Sciences.

The original problem:

Let ξ₁, . . . , ξ_n be a sequence of i.i.d. random variables with some distribution µ on a space (X, X).

Let us have a nice class of functions F on the space (X, X) such that ^R f(x)µ(dx) = 0 for all f ∈ F.

Give a good estimate on the tail distribution P



sup

f∈F

√1 n

n X j=1

f(ξ_j) > x



 for all x > 0.

(3)

(The actual problem was its multivariate version about the tail distribution of the supremum of (degenerated) U-statistics.)

The main questions discussed in this talk:

(a) What is the natural Gaussian counterpart of this problem, and what kind of result does it suggest?

(b) When does the similarity with the Gaussian counterpart stop to exist, and what can be told in that case?

(c) A conjecture of Talagrand, he called the Bernoulli conjecture.

A problem similar to the tail distribution of the supremum of partial sums investigated by

(4)

Michel Talagrand: Give a good upper bound on

E sup

f∈F

√1 n

n X j=1

f(ξ_j).

There is a concentration inequality which says that

P



sup

f∈F

√1 n

n X j=1

f(ξ_j) − E sup

f∈F

√1 n

n X j=1

f(ξ_j) > x





is small. Hence the two problems are equivalent. Moreover, Talagrand’s estimate on the expectation of the supremum contains an im- plicit estimate on its tail distribution.

First we consider the analogous Gaussian problem.

Let η_t, Eη_t = 0, t ∈ T, be a countable set of (jointly) Gaussian random variables. Put

(5)

d₂(s, t) = ^hE(η_s − η_t)²ⁱ^1/2, s, t ∈ T. Then d₂(s, t) is a metric on the parameter set T.

The problem: Give a good estimate on E sup

t∈T η_t with the help of the function d₂(s, t).

Results with the help of a classical and natural method, the the chaining argument.

A classical estimate due to R. M. Dudley.

Define

e(n) = min

T_n⊂T,cardT_n≤2²ⁿ{α: for all t ∈ T there is some ¯t ∈ T_n such that d₂(t,¯t) ≤ α} Theorem of Dudley:

E sup

t∈T η_t ≤

X∞ n=0

2^n/2e(n).

(6)

The content of the notion e(n). Place uniformly 2²ⁿ points on the set T in such a way that all points t ∈ T are close to some point of this set. The number e(n) is the smallest α for which all points are closer to this set than α.

(We look for a dense net of T consisting of 2²ⁿ points.)

The idea of the proof:

Fix some u > 0, and sets T_n ⊂ T, card T_n ≤ 2²ⁿ, n = 1,2, . . ., ^∞^S

n=1T_n = T, and put Q(u) = P



sup

t∈T η_t > u

X∞ n=0

2^n/2e(n)





and

Q_N(u) = P



 sup

t∈T₁∪···∪T_N η_t > u

N−1 X n=0

2^n/2e(n)



, N = 0,1, 2, . . . .

(7)

We can estimate well Q(u) for all u ≥ 2, by giving a good bound on Q_N(u)−Q_N₋₁(u) for all N = 1,2, . . .. Here we exploit that for all t ∈ T_N there is some ¯t ∈ T_N₋₁ which is close to it. We get a good estimate on the tail distribution of sup

t∈T η_t which shows that the main contribution to the expectation we want to bound comes from the event sup

t∈T η_t ≤ 2 ^P^∞

n=0

2^n/²e(n).

Talagrand found a sharper estimate by intro- ducing the right quantity γ₂(T, d) needed in the study of this problem.

To define it, let us first introduce the diameter

∆(A) = sup

s,t∈Ad(s, t) of a set A ⊂ T in a metric space (T, d), and the notion of

Admissible sequence of partitions. A sequence of refining partitions A0 ⊂ A1 ⊂ A2 ⊂

(8)

· · · in the parameter set T is an admissible sequence of partitions if card A0 = 1, and

cardAn ≤ 2²ⁿ, n = 1,2, . . ..

Given an admissible partition A0 ⊂ A1 ⊂ A2 ⊂

· · · and a point t ∈ T let A_n(t) be that element B of the partition Aⁿ for which t ∈ B.

Given a countable parameter set T with a metric d we define

γ₂(T, d) = inf sup

t∈T

X∞ n=0

2^n/2∆(A_n(t)),

where the infimum is taken for all admisssible sequences of partitions of T.

The estimate of Talagrand.

Let η_t, Eη_t = 0, t ∈ T, be a sequence of Gaus- sian random variables, with the metric d₂(s, t) = [E(η_t − η_s)²]^1/2 on T. Then

E sup

t∈T η_t ≤ Lγ₂(T, d₂).

(9)

Moreover 1

Lγ₂(T, d₂) ≤ E sup

t∈T η_t ≤ Lγ₂(T, d₂) with a universal constant L.

The same upper bound holds for the supremum of random variables U_t, t ∈ T, with d(s, t)² = E(U_s−U_t)² if their tail distribution satisfies the inequality

P(|U_s − U_t| > u) ≤ e⁻^Cu²^/d(s,t)²

for all s, t ∈ T and u > 0 with a universal constant C.

Here we demanded a Gaussian type tail be- haviour.

What can be told about the tail distribution of sums of independent random variables?

A classical result, Bernstein’s inequality says:

(10)

Bernstein’s inequality. Let ξ₁, . . . , ξ_n be independent random variables,

P(|ξ_j| ≤ 1) = 1, and Eξ_j = 0, 1 ≤ j ≤ n.

Put σ_j² = Eξ_j², 1 ≤ j ≤ n, S_n = ^Pⁿ

j=1ξ_j and VarS_n = V_n² = ^Pⁿ

j=1σ_j². Then

P(S_n > u) ≤ exp











− u² 2V_n²

1 + ^u

3V_n²











for all numbers u > 0.

If u ≤ const.V_n² it supplies a Gaussian type estimate, but if u ≫ V_n², it supplies a bad estimate. Only very weak improvement is possible which does not help if u ≫ V_n².

For normalized partial sums S_n(f) = 1

√n

n X

j=1

f(ξ_j), f ∈ F,

(11)

of i.i.d. random variables ξ_j, Ef(ξ₁) = 0 and supx |f(x)| ≤ 1 for all f ∈ F the following Gaus- sian type estimate holds.

P(|S_n(f) − S_n(g)| ≥ u) ≤ 2e⁻^u²^/100d²²^(f,g), if u ≤ 3d₂(f, g)²√

n

with d₂(f, g)² = ^R(f(x) −g(x))²µ( dx), where µ is the distribution of ξ_j.

The main problem in the study of the supremum of partial sums:

We have no good bound on the tail distribution of P(|S_n(f) − S_n(g)| > u) if d₂(f, g)² =

R(f(x)−g(x))²µ( dx) is small and the level u in the probability is large. How to overcome this problem?

We have to impose some good conditions on the class of functions F.

(12)

Put T = F, and define on it the metrics d₂(f, g) =

Z

(f − g)²µ( dx)

_1/2

, f, g ∈ F, and

d_∞(f, g) = sup

x |f(x) − g(x)|, f, g ∈ F.

Two approaches:

(a) Talagrand’s approach.

It exploits that if sup

x |f(x)| < c with a small number c > 0, then a good (Gaussian type) estimate holds in the Bernstein inequality in a larger interval. In this case we have a good tail estimate for u ≤ ¹_c · 3d²₂(f, g)².

Put (similarly to γ₂(T, d)) γ_α(F, d) = inf sup

t∈F

X∞ n=0

2^n/α∆(A_n(t)),

(13)

with arbitrary number α > 0 and metric d on F, where the infimum is taken for all admissible sequences of partitions An, n = 0,1,2, . . ., of F, and ∆(A_n(t)) is the diameter of A_n(t) with respect to the metric d.

Theorem A. (Talagrand’s estimate on the supremum of a class of partial sums). Let γ₁(F, d_∞) denote the quantity γ_α(F, d) with α = 1 and d(f, g) = d_∞(f, g) in the set T = F. Then

E sup

f∈F

√1 n

n X j=1

f(ξ_j)

≤ L γ₂(F, d₂) + 1

√nγ₁(F, d_∞)

!

with an appropriate universal constant L > 0.

This result gives a good estimate on the supremum of (normalized) partial sums if both

γ₂(F, d₂) and γ₁(F, d_∞) are small. The idea of

(14)

the proof is to adapt the proof of the Gaus- sian counterpart to this case and to exploit that if we have a subclass of F with not too large cardinality which is dense also in the L_∞ norm then the Bernstein inequality with (random variables whose supremum is bounded by a small number) gives a sufficiently good estimate, and the chaining argument can be applied.

An example when this result gives sharp estimate.

Let X₁, . . . , X_n be a sequence of independent random variables, uniformly distributed on the square [0, 1] × [0,1], and let C be the class of Lipschitz 1 functions f(x) on [0,1]×[0,1] such that

Z

[0,1]×[0,1] f(x)dx = 0.

(15)

Then

E sup

f∈C

√1 n

n X l=1

(f(X_l)

≤ L^plog n

with a universal constant L.

This result is equivalent to a (famous) result of Ajtai–Koml´os–Tusn´ady.

The problem solved by Ajtai–Koml´os–Tusn´ady:

Take two independent sequences X₁, . . . , X_n and Y₁, . . . , Y_n of independent random variables uniformly distributed on the unit square [0,1] × [0,1]. Let us take such a (random) permutation Y_π(1), . . . , Y_π(n) of the indices of random variables Y₁, . . . , Y_n for which X_j and Y_π₍_j₎ are close to each other for all indices j. More pre- cisely we want that

E



 n X j=1

ρ(X_j, Y_π(j))





(16)

be as small as possible, where ρ(·,·) is the Euclidean metric.

Theorem of Ajtai–Koml´os–Tusn´ady E



 n X j=1

ρ(X_j, Y_π(j))



 ≤ ^pnlog n

for an appropriate permutation (π(1), . . . , π(n)) of the set {1, . . . , n} with a universal constant L, and this estimate is sharp.

Example when the above estimate of Tala- grand does not give a good estimate.

Let f(x₁, . . . , x_k), |f(x₁, . . . , x_k)| ≤ 1 be a function on R^k, µ a probability measure on R^k. Take a nice class D = {D₁, D₂, . . .} of sets D_l ⊂ R^k, l = 1,2, . . ., let ¯f_l be the restriction of f to D_l, i.e. let

f¯_l(x₁, . . . , x_k)

=

( f(x₁, . . . , x_k) if (x₁, . . . , x_k) ∈ D_l 0 if (x₁, . . . , x_k) ∈/ D_l

(17)

and

f(x₁, . . . , x_k) = ¯f(x₁, . . . , x_k)−

Z f¯(x₁, . . . , x_k) dµ.

Give a good bound on E sup

l

√1 n

Pn

j=1f_l(ξ_j).

In this case the quantity γ₁(F, d_∞) in Theo- rem A cannot be well bounded.

To get a good estimate in this case introduce the following notion.

Definition of L₂-dense classes of functions.

Let a measurable space (X, X) be given together with a set F of X measurable real val- ued functions on this space. F is called an L₂-dense class of functions with parameter D and exponent L if for all numbers 1 ≥ ε > 0 and probability measure ν there exists a fi- nite ε-dense subset F^ε = {f₁, . . . , f_m} ⊂ F in the space L₂(X,X, ν) with m ≤ Dε⁻^L elements

(18)

such that n inf

f_j∈F^ε

R |f − f_j|² dν < ε² for all functions f ∈ F.

Then we have the following result.

Theorem B. (Estimate on the supremum of a class of partial sums). Let us consider a sequence of independent and identically distributed random variables ξ₁, . . . , ξ_n, n ≥ 2, with values in a measurable space (X, X) and with some distribution µ. Beside this, let a countable and L₂-dense class of functions F with some parameter D ≥ 1 and exponent L ≥ 1 be given on the space (X, X) which satisfies the conditions

kfk_∞ = sup

x∈X |f(x)| ≤ 1, for all f ∈ F kfk²2 =

Z

f²(x)µ( dx) ≤ σ² for all f ∈ F with some constant 0 < σ ≤ 1, and

Z

f(x)µ(dx) = 0 for all f ∈ F.

(19)

Define the normalized partial sums S_n(f) =

√1n Pn k=1

f(ξ_k) for all f ∈ F.

There exist some universal constants C > 0, α > 0 and M > 0 such that the supremum of the normalized random sums S_n(f), f ∈ F, satisfies the inequality

P sup

f∈F |S_n(f)| ≥ u

!

≤ C exp

(

−α

u σ

₂)

for those numbers u for which

√nσ² ≥ u ≥ M σ(L^3/4 log^1/2 2

σ + (logD)^3/4) with the parameter D and exponent L of the L₂-dense class F.

Under the conditions of this theorem the supremum of partial sums is not much greater than its largest (worst) term.

If we take Vapnik–ˆCervonenkis class of sets D = {D₁, D₂, . . .} and define the above con- sidered functions f_l, l = 1,2, . . ., then F =

(20)

{f₁, f₂, . . .} is an L₂-dense class of functions, and Theorem B can be applied for it.

The proof of Theorem B is based on some symmetrization and conditioning argument. We apply the following results.

Symmetrization Lemma. Let us fix a countable class of functions F on a measurable space (X, X) together with a real number 0 < σ < 1. Consider a sequence of independent and identically distributed random variables ξ₁, . . . , ξ_n with values in the space (X,X) such that

Ef(ξ₁) = 0, Ef²(ξ₁) ≤ σ² for all f ∈ F together with another sequence ε₁, . . . , ε_n of independent random variables with distribution P(ε_j = 1) = P(ε_j = −1) = ¹₂, 1 ≤ j ≤ n, independent also of the random sequence ξ₁, . . . , ξ_n.

(21)

Then P





√1

n sup

f∈F

n X j=1

f(ξ_j)

≥ An^1/2σ²





≤ 4P





√1

n sup

f∈F

n X j=1

ε_jf(ξ_j)

≥ A

3n^1/2σ²





if A ≥ 3√

√ 2 nσ.

Theorem (Hoeffding’s inequality). Let ε₁, . . . , ε_n be independent random variables, P(ε_j = 1) = P(ε_j = −1) = ¹₂, 1 ≤ j ≤ n, and let a₁, . . . , a_n be arbitrary real numbers. Put V =

Pn

j=1a_jε_j. Then P(V > u) ≤ exp







− u²

2^Pⁿ_j=1 a²_j







for all u > 0.

The Hoeffding inequality gives always a good Gaussian type estimate.

(22)

The symmetrization lemma enables us to re- place the estimation of

P





√1

n sup

f∈F

n X j=1

f(ξ_j)

≥ An^1/2σ²





by the estimation of P





√1

n sup

f∈F

n X j=1

ε_jf(ξ_j)

≥ A

3n^1/2σ²



.

This can be done by estimating the conditional probabilities

P





√1

n sup

f∈F

n X j=1

ε_jf(ξ_j)

≥ A

3n¹^/²σ²

|ξ₁ = x₁, . . . , ξ_n = x_n)

= P





√1

n sup

f∈F

n X j=1

ε_jf(x_j)

≥ A

3n¹^/²σ²



. The right-hand side can be bounded by the Hoeffding inequality.

(23)

On the basis of this observation a proof of Theorem B can be worked out.

The conjecture of Talagrand

Let a sequence of Bernoulli sums η_j = ^P^N

l=1a_j,lε_l, j = 1, . . . , M, be given, where ε₁,. . . , ε_N, P(ε₁ = 1) = P(ε₁ = −1) = ¹₂ are independent random variables. Give a good estimate on

E sup

1≤j≤M η_j.

We would like to give both an upper and a lower bound in such a way that the proportion of these two estimates is less than a universal constant.

Put T = {1, . . . , M}, d²₂(i, j) = E(η_i−η_j)² =

N X j=1

(a_i,l−a_j,l)², i, j ∈ T,

(24)

and define with the help of this metric the quantity γ₂(T, d₂) and γ₂(T₁, d₂) for all sets T₁ ⊂ T. The Gaussian estimate hold for the E sup

j∈T₁ η_j (Hoeffding inequality), hence E sup

j∈T₁ η_j ≤ γ₂(T₁, d₂) for all T₁ ⊂ T.

On the other hand,

PN l=1

a_j,lε_l

≤ ^P^N

l=1|a_j,l|. Put b(T₂) = sup

j∈T₂ PN

l=1|a_j,l|. Then E sup

1≤j≤M η_j ≤ L inf

T1,T2⊂T,

T₁∪T₂=T

(γ₂(T₁, d) + b(T₂)).

Talagrand’s conjecture: This estimate is sharp:

E sup

1≤j≤M η_j ≥ 1

L inf

T1,T2⊂T,

T₁∪T₂=T

(γ₂(T₁, d) + b(T₂)).

(25)

My conjecture: Talagrand’s conjecture does not hold. Moreover, there is no good estimate (where the proportion of the upper and lower bound is less than a universal constant) by means of γ₂(T₁, d) and b(T₂).

To prove the lower bound for E sup

j∈T η_j we need some estimates which say that if some random variables are far from each other in some sense, than their supremum is large.

In the Gaussian case Sudakov’s inequality holds.

Sudakov’s inequality. Let M Gaussian random variables ξ₁, . . . , ξ_M be given, for which Eξ_j = 0, and E(ξ_j − ξ_k)² ≥ a² for all pairs 1 ≤ j < k ≤ M and some number a > 0. Then

E sup

1≤j≤M ξ_j ≥ a L

plogM with a universal number L > 0.

(26)

This inequality is sharp.

A version of it for Bernoulli sums:

Theorem. Let a set of Bernoulli sequences η_j = ^P^N

l=1a_j,lε_l, j = 1, . . . , M, be given. Put a²_j = ^P^N

l=1a²_j,l for all 1 ≤ j ≤ M, B = sup

1≤j≤M

a_j and

C = sup

1≤j≤M sup

l≤l≤N |a_j,l|.

If |a_j − a_j_′| ≥ ¹₄B for all 1 ≤ j, j^′ ≤ M, and C ≤ _L ^B

0√

logM with a sufficiently large universal constant L₀, then

E sup

1≤j≤M η_j ≥ B L

plog M .

This inequality is sharp only if the Bernoulli sequences are in the ‘Gaussian domain’.