• Nem Talált Eredményt

The theory of statistical decisions

N/A
N/A
Protected

Academic year: 2022

Ossza meg "The theory of statistical decisions"

Copied!
59
0
0

Teljes szövegt

(1)

The theory of statistical decisions

L´aszl´o Gy¨orfi

Department of Computer Science and Information Theory Budapest University of Technology and Economics

1521 Stoczek u. 2, Budapest, Hungary gyorfi@cs.bme.hu

March 14, 2012

Abstract

In this note we survey the theory of statistical decisions, i.e., con- sider statistical inferences, where the target of the inference takes finitely many values. For the formulation of the Bayes decision, the aim is to minimize the weighted average of conditional error prob- abilities. In the scheme of simple statistical hypotheses testing we constrain a conditional error probability and minimize the other one.

Study the composite hypotheses, the testing of homogeneity and the testing of independence, too. In the analysis the divergences (L1- distance, I-divergence, Hellinger distance, etc.) between probability distributions play an important role.

(2)

Contents

1 Bayes decision 3

1.1 Bayes risk . . . 3

1.2 Approximation of Bayes decision . . . 6

2 Testing simple hypotheses 8 2.1 α-level tests . . . . 8

2.2 φ-divergences . . . 11

2.3 Repeated observations . . . 15

3 Testing simple versus composite hypotheses 20 3.1 Total variation and I-divergence . . . 20

3.2 Large deviation of L1 distance . . . 22

3.3 L1-distance-based strongly consistent test . . . 30

3.4 L1-distance-based α-level test . . . 32

3.5 I-divergence-based strongly consistent test . . . 33

3.6 I-divergence-based α-level test . . . 34

4 Robust detection: testing composite versus composite hy- potheses 35 5 Testing homogeneity 38 5.1 The testing problem . . . 38

5.2 L1-distance-based strongly consistent test . . . 39

5.3 L1-distance-based α-level test . . . 43

6 Testing independence 44 6.1 The testing problem . . . 44

6.2 L1-based strongly consistent test . . . 46

6.3 L1-based α-level test . . . 50

6.4 I-divergence-based strongly consistent test . . . 51

6.5 I-divergence-based α-level test . . . 53

(3)

1 Bayes decision

1.1 Bayes risk

For the statistical inference, a d-dimensional observation vector X is given, and based on X, the statistician has to make an inference on a random variableY, which takes finitely many values, i.e., it takes values from the set {1,2, . . . , m}. In fact, the inference is a decision formulated by a decision function

g :Rd→ {1,2, . . . , m}.

If g(X)6=Y then the decision makes error.

In the formulation of the Bayes decision problem, introduce a cost func- tion C(y, y0) 0, which is the cost if the label Y = y and the decision g(X) = y0. For a decision function g, the risk is the expectation of the cost:

R(g) =E{C(Y, g(X))}.

In Bayes decision problem, the aim is to minimize the risk, i.e., the goal is to find a function g :Rd→ {1,2, . . . , m} such that

R(g) = min

g:Rd→{1,2,...,m}R(g), (1) where g is called the Bayes decision function, and R =R(g) is the Bayes risk.

For the posteriori probabilities, introduce the notations:

Py(X) = P{Y =y|X}.

Let the decision function g be defined by g(X) = arg min

y0

Xm

y=1

C(y, y0)Py(X).

If arg min is not unique then choose the smallest y0, which minimizes Pm

y=1C(y, y0)Py(X). This definition implies that for any decision functiong, Xm

y=1

C(y, g(X))Py(X) Xm

y=1

C(y, g(X))Py(X). (2)

(4)

Theorem 1 For any decision function g, we have that R(g)≤R(g).

Proof. For a decision function g, let’s calculate the risk.

R(g) = E{C(Y, g(X))}

= E{E{C(Y, g(X))|X}}

= E

( m X

y=1

Xm

y0=1

C(y, y0)P{Y =y, g(X) = y0 |X}

)

= E

(Xm

y=1

Xm

y0=1

C(y, y0)I{g(X)=y0}P{Y =y|X}

)

= E

( m X

y=1

C(y, g(X))Py(X) )

, where I denotes the indicator. (2) implies that

R(g) = E ( m

X

y=1

C(y, g(X))Py(X) )

E (Xm

y=1

C(y, g(X))Py(X) )

= R(g).

2 Concerning the cost function, the most frequently studied example is the so called 01 loss:

C(y, y0) =

½ 1 if y6=y0, 0 if y=y0.

For the 01 loss, the corresponding risk is the error probability:

R(g) =E{C(Y, g(X))}=E{I{Y6=g(X)}}=P{Y 6=g(X)}, and the Bayes decision is of form

g(X) = arg min

y0

Xm

y=1

C(y, y0)Py(X) = arg min

y0

X

y6=y0

Py(X) = arg max

y0

Py0(X),

(5)

which is called maximum posteriori decision, too.

If the distribution of the observation vectorXhas density, then the Bayes decision has an equivalent formulation. Introduce the notations for density of X by

P{X ∈B}= Z

B

f(x)dx and for the conditional densities by

P{X∈B |Y =y}= Z

B

fy(x)dx and for a priori probabilities

qy =P{Y =y}, then it is easy to check that

Py(X) = P{Y =y |X=x}= qyfy(x) f(x) and therefore

g(x) = arg min

y0

Xm

y=1

C(y, y0)Py(x)

= arg min

y0

Xm

y=1

C(y, y0)qyfy(x) f(x)

= arg min

y0

Xm

y=1

C(y, y0)qyfy(x).

From the proof of Theorem 1 we may derive a formula for the optimal risk:

R(g) = E (

miny0

Xm

y=1

C(y, y0)Py(X) )

.

(6)

If X has density then

R(g) = E (

miny0

Xm

y=1

C(y, y0)qyfy(X) f(X)

)

= Z

Rd

miny0

Xm

y=1

C(y, y0)qyfy(x)

f(x) f(x)dx

= Z

Rd

miny0

Xm

y=1

C(y, y0)qyfy(x)dx.

For the 01 loss, we get that R(g) = E

½

miny0 (1−Py0(X))

¾ , which has the form, for densities,

R(g) = Z

Rd

miny0 (f(x)−qy0fy0(x))dx= 1 Z

Rd

maxy0 qy0fy0(x)dx.

1.2 Approximation of Bayes decision

In practice, the posteriori probabilities{Py(X)}are unknown. If we are given some approximations{Pˆy(X)}, from which one may derive some approximate decision

ˆ

g(X) = arg min

y0

Xm

y=1

C(y, y0) ˆPy(X) then the question is how well R(ˆg) approximates R. Lemma 1 Put Cmax= maxy,y0C(y, y0), then

0≤R(ˆg)−R(g)2Cmax

Xm

y=1

E n

|Py(X)−Pˆy(X)|

o .

(7)

Proof. We have that R(ˆg)−R(g) = E

(Xm

y=1

C(y,g(X))Pˆ y(X) )

E (Xm

y=1

C(y, g(X))Py(X) )

= E

(Xm

y=1

C(y,g(X))Pˆ y(X) Xm

y=1

C(y,g(X)) ˆˆ Py(X) )

+E (Xm

y=1

C(y,g(X)) ˆˆ Py(X) Xm

y=1

C(y, g(X)) ˆPy(X) )

+E ( m

X

y=1

C(y, g(X)) ˆPy(X) Xm

y=1

C(y, g(X))Py(X) )

. The definition of ˆg implies that

Xm

y=1

C(y,g(X)) ˆˆ Py(X) Xm

y=1

C(y, g(X)) ˆPy(X)0, therefore

R(ˆg)−R(g) E ( m

X

y=1

C(y,g(X))|Pˆ y(X)−Pˆy(X)|

)

+E ( m

X

y=1

C(y, g(X))|Pˆy(X)−Py(X)|

)

2Cmax Xm

y=1

En

|Py(X)−Pˆy(X)|o .

2 In the special case of the approximate maximum posteriori decision the inequality in Lemma 1 can be slightly improved:

0≤R(ˆg)−R(g) Xm

y=1

E n

|Py(X)−Pˆy(X)|

o .

Based on this relation, one can introduce efficient pattern recognition rules.

(For the details, see Devroye, Gy¨orfi, and Lugosi [21].)

(8)

2 Testing simple hypotheses

2.1 α-level tests

In this section we consider decision problems, where the consequences of the various errors are very much different. For example, if in a diagnostic problem Y = 0 means that the patient is OK, while Y = 1 means that the patient is ill, then for Y = 0 the false decision is that the patient is ill, which implies some superfluous medical treatment, while for Y = 1 the false decision is that the illness is not detected, and the patient’s state may become worse.

A similar situation happens for radar detection.

The event Y = 0 is called null hypothesis and is denoted by H0, and the event Y = 1 is called alternative hypothesis and is denoted by H1. The decision, the test is formulated by a set A Rd, called acceptance region such that accept H0 if X ∈A, otherwise reject H0, i.e., accept H1. The set Ac is called critical region.

Let P0 and P1 be the probability distributions of X under H0 and H1, respectively. There are two types of errors:

Error of the first kind, if under the null hypothesis H0 we reject H0. This error is P0(Ac).

Error of the second kind, if under the alternative hypothesis H1 we reject H1. This error is P1(A).

Obviously, one decreases the error of the first kind P0(Ac) if the error of the second kindP1(A) increases. We can formulate the optimization problem such that minimize the error of the second kind under the condition that the error of the first kind is at most 0< α <1:

A:Pmin0(Ac)≤αP1(A). (3)

In order to solve this problem the Neyman-Pearson Lemma plays an impor- tant role.

Theorem 2 (Neyman, Pearson [45]) Assume that the distributions P0 and P1 have densities f0 and f1:

P0(B) = Z

B

f0(x)dx and P1(B) = Z

B

f1(x)dx.

(9)

For a γ >0, put

Aγ ={x:f0(x)≥γf1(x)}.

If for any set A

P0(Ac)≤P0(Acγ) then

P1(A)≥P1(Aγ).

Proof. Because of the condition of the theorem, we have the following chain of inequalities:

P0(Ac) P0(Acγ)

P0(Ac∩Aγ) +P0(Ac∩Acγ) P0(A∩Acγ) +P0(Ac∩Acγ) Z

Ac∩Aγ

f0(x)dx Z

A∩Acγ

f0(x)dx.

The definition of Aγ implies that γ

Z

Ac∩Aγ

f1(x)dx Z

Ac∩Aγ

f0(x)dx Z

A∩Acγ

f0(x)dx≤γ Z

A∩Acγ

f1(x)dx, therefore using the previous chain of derivations in a reverse order we get that

P1(Ac)≤P1(Acγ).

2 If for an 0< α <1 there is aγ =γ(α), which solves the equation

P0(Acγ) =α,

then the Neyman-Pearson Lemma implies that in order to solve the problem (3), it is enough to search for set of form Aγ, i.e.,

A:Pmin0(Ac)≤αP1(A) = min

Aγ:P0(Acγ)≤αP1(Aγ).

Then Aγ is called the most powerful α-level test.

Because of the Neyman-Pearson Lemma, we introduce the likelihood ratio statistic

T(X) = f0(X) f1(X),

(10)

and so the null hypothesis H0 is accepted if T(X)≥γ.

Example 1. As an illustration of the Neyman-Pearson Lemma, consider the example of an experiment, where the null hypothesis is that the components of X are i.i.d. normal with mean m = m0 > 0 and with variance σ2, while under the alternative hypothesis the components of Xare i.i.d. normal with mean m1 = 0 and with the same varianceσ2. Then

f0(x) =f0(x1, . . . , xd) = Yd

i=1

µ 1

2πσe(xi−m)22

and

f1(x) =f1(x1, . . . , xd) = Yd

i=1

µ 1

2πσex

2i 2

and f0(X)

f1(X) ≥γ means that

Xd

i=1

(Xi−m)22 +

Xd

i=1

Xi2

2 lnγ, or equivalently,

Xd

i=1

(2Xim−m2)2lnγ.

This test accepts the null hypothesis if 1

d Xd

i=1

Xi 2lnγ/d+m2

2m = σ2lnγ dm +m

2 =:γ0. This test is based on the linear statistic Pd

i=1Xi/d, and the question left is how to choose the critical value γ0, for which it is an α-level test, i.e., the error of the first kind is α:

P0 (1

d Xd

i=1

Xi ≤γ0 )

=α.

(11)

Under the null hypothesis, the distribution of 1dPd

i=1Xiis normal with mean m and with variance σ2/d, therefore

P0 (

1 d

Xd

i=1

Xi ≤γ0 )

= Φ

µγ0−m σ/√

d

,

where Φ denotes the standard normal distribution function, and so the critical value γ0 of anα-level test solves the equation

Φ µ

−m−γ0 σ/√

d

=α, i.e.,

γ0 =m−Φ−1(1−α)σ/√ d.

Remark 1. In many situations, when dis large enough, one can refer to the central limit theorem such that the log-likelihood ratio

lnf0(X) f1(X)

is asymptotically normal. The argument of Example 1 can be extended if under H0, the log-likelihood ratio is approximately normal with mean m0 and with variance σ02. Let the test be defined such that it accepts H0 if

lnf0(X) f1(X) ≥γ0, where

γ0 =m0Φ−1(1−α)σ0. Then this test is approximately an α-level test.

2.2 φ-divergences

In the analysis of repeated observations the divergences between distribu- tion play an important role. Imre Csisz´ar [14] introduced the concept of φ-divergences. Let φ: (0,∞)→R be a convex function, extended on [0,∞) by continuity such that φ(1) = 0. For the probability distributions µand ν,

(12)

let λ be a σ-finite dominating measure of µ and ν, for example, λ = µ+ν.

Introduce the notations

f = and

g = dλ. Then the φ-divergenceof µ and ν is defined by

Dφ(µ, ν) = Z

Rd

φ

µf(x) g(x)

g(x)λ(dx). (4)

The Jensen inequality implies the most important property of the φ- divergences:

Dφ(µ, ν) = Z

Rd

φ

µf(x) g(x)

g(x)λ(dx)≥φ µZ

Rd

f(x)

g(x)g(x)λ(dx)

=φ(1) = 0.

It means that Dφ(µ, ν) 0 and if µ=ν then Dφ(µ, ν) = 0. If, in addition, φ is strictly convex at 1 thenDφ(µ, ν) = 0 iff µ=ν.

Next we show some examples.

For

φ1(t) =|t−1|, we get the L1 distance

Dφ1(µ, ν) = Z

Rd

|f(x)−g(x)|λ(dx).

For

φ2(t) = (

t−1)2, we get the squared Hellinger distance

Dφ2(µ, ν) = Z

Rd

³pf(x)−p g(x)

´2 λ(dx)

= 2 µ

1 Z

Rd

pf(x)g(x)λ(dx)

.

(13)

For

φ3(t) =lnt, we get the I-divergence

I(µ, ν) = Dφ3(µ, ν) = Z

Rd

ln

µg(x) f(x)

g(x)λ(dx).

For

φ4(t) = (t1)2, we get the χ2-divergence

χ2(µ, ν) =Dφ4(µ, ν) = Z

Rd

(f(x)−g(x))2

g(x) λ(dx).

An equivalent definition of theφ-divergence is Dφ(µ, ν) = sup

P

X

j

φ

µµ(Aj) ν(Aj)

ν(Aj), (5)

where the supremum is taken over all finite Borel measurable partitions P = {Aj} of Rd.

The main reasoning of this equivalence is that for any partitionP ={Aj}, the Jensen inequality implies that

Dφ(µ, ν) = Z

Rd

φ

µf(x) g(x)

g(x)λ(dx)

= X

j

Z

Aj

φ

µf(x) g(x)

g(x)λ(dx)

= X

j

1 ν(Aj)

Z

Aj

φ

µf(x) g(x)

g(x)λ(dx)ν(Aj)

X

j

φ Ã

1 ν(Aj)

Z

Aj

f(x)

g(x)g(x)λ(dx)

! ν(Aj)

= X

j

φ

µµ(Aj) ν(Aj)

ν(Aj).

(14)

The sequence of partitionsP1,P2, . . . is called nested if any cellA∈ Pn+1

is a subset of a cell A0 ∈ Pn. Next we show that for nested sequence of partitions

X

A∈Pn

φ

µµ(A) ν(A)

ν(A)↑.

Again, this property is the consequence of the Jensen inequality:

X

A0∈Pn+1

φ

µµ(A0) ν(A0)

ν(A0) = X

A∈Pn

 X

A0∈Pn+1,A0⊂A

φ

µµ(A0) ν(A0)

ν(A0)

= X

A∈Pn

 X

A0∈Pn+1,A0⊂A

φ

µµ(A0) ν(A0)

ν(A0) ν(A)

ν(A)

X

A∈Pn

φ

 X

A0∈Pn+1,A0⊂A

µ(A0) ν(A0)

ν(A0) ν(A)

ν(A)

= X

A∈Pn

φ

µµ(A) ν(A)

ν(A).

It implies that there is a nested sequence of partitions P1,P2, . . . such that X

A∈Pn

φ

µµ(A) ν(A)

ν(A)↑sup

P

X

A∈P

φ

µµ(A) ν(A)

ν(A).

The sequence of partitions P1,P2, . . . is called asymptotically fine if for any sphere S centered at the origin

n→∞lim max

A∈Pn,A∩S6=0diam(A) = 0. (6)

One can show that if the nested sequence of partitions P1,P2, . . . is asymp- totically fine then

X

A∈Pn

φ

µµ(A) ν(A)

ν(A)↑ Z

Rd

φ

µf(x) g(x)

g(x)λ(dx).

This final step is verified in the particular case of L1 distance. (Cf. Section 3.3.) In general, we may introduce a cell wise constant approximation of fg(x)(x):

Fn(x) := µ(A)

ν(A) if x∈A.

(15)

Thus,

X

A∈Pn

φ

µµ(A) ν(A)

ν(A) = Z

Rd

φ(Fn(x))g(x)λ(dx) and

Fn(x) f(x) g(x) for almost all x modλ withg(x)>0 such that

Z

Rd

φ(Fn(x))g(x)λ(dx)→ Z

Rd

φ

µf(x) g(x)

g(x)λ(dx).

2.3 Repeated observations

The error probabilities can be decreased if instead of an observation vectorX, we are given n vectorsX1, . . . ,Xn such that under H0,X1, . . . ,Xn are inde- pendent and identically distributed (i.i.d.) with distribution P0, while under H1, X1, . . . ,Xn are i.i.d. with distribution P1. In this case the likelihood ratio statistic is of form

T(X) = f0(X1)· . . . ·f0(Xn) f1(X1)· . . . ·f1(Xn).

The Stein Lemma below says that there are tests, for which both the error of the first kind αn and the error of the second kind βn tend to 0, if n → ∞.

In order to formulate the Stein Lemma, we introduce the I-divergence (called also relative entropy)

D(f0, f1) = Z

Rd

f0(x) lnf0(x)

f1(x)dx, (7)

(cf. Section 2.2).

The I-divergence is always non-negative:

−D(f0, f1) = Z

Rd

f0(x) lnf1(x) f0(x)dx≤

Z

Rd

f0(x)

µf1(x) f0(x)1

dx= 0.

Theorem 3 (Stein [58]) For any 0 < δ < D(f0, f1), there is a test such that the error of the first kind

αn 0,

(16)

and for the error of the second kind

βn ≤e−n(D(f0,f1)−δ) 0.

Proof. Construct a test such that accept the null hypothesis H0 if f0(X1)· . . . ·f0(Xn)

f1(X1)· . . . ·f1(Xn) ≥en(D(f0,f1)−δ), or equivalently

1 n

Xn

i=1

lnf0(Xi)

f1(Xi) ≥D(f0, f1)−δ.

Under H0, the strong law of large numbers implies that 1

n Xn

i=1

lnf0(Xi)

f1(Xi) →D(f0, f1)

almost surely (a.s.), therefore for the error of the first kind αn, we get that αn=P0

( 1 n

Xn

i=1

lnf0(Xi)

f1(Xi) < D(f0, f1)−δ )

0.

Concerning the error of the second kind βn we have the following simple bound:

βn

= P1

½f0(X1)· . . . ·f0(Xn)

f1(X1)· . . . ·f1(Xn) ≥en(D(f0,f1)−δ)

¾

= Z

nf

0(x1)·...·f0(xn)

f1(x1)·...·f1(xn)≥en(D(f0,f1)−δ)of1(x1)· . . . ·f1(xn)dx1, . . . , dxn

e−n(D(f0,f1)−δ) Z

nf0(x1)·...·f0(xn)

f1(x1)·...·f1(xn)≥en(D(f0,f1)−δ)of0(x1)· . . . ·f0(xn)dx1, . . . , dxn

e−n(D(f0,f1)−δ).

2 The critical value of the test in the proof of the Stein Lemma used the I- divergenceD(f0, f1). Without knowingD(f0, f1), the Chernoff Lemma below results in exponential rate of convergence of the errors.

(17)

Theorem 4 (Chernoff [12]). Construct a test such that accept the null hypothesis H0 if

f0(X1)· . . . ·f0(Xn) f1(X1)· . . . ·f1(Xn) 1, or equivalently

Xn

i=1

lnf0(Xi) f1(Xi) 0.

(This test is called maximum likelihood test.) Then αn

µ infs>0

Z

Rd

f1(x)sf0(x)1−sdx

n

and

βn µ

infs>0

Z

Rd

f0(x)sf1(x)1−sdx

n .

Proof. Apply the Chernoff bounding technique such that for any s > 0 the Markov inequality implies that

αn = P0 ( n

X

i=1

lnf0(Xi) f1(Xi) <0

)

= P0 (

s Xn

i=1

lnf1(Xi) f0(Xi) >0

)

= P0

½ es

Pn

i=1lnff1(Xi)

0(Xi) >1

¾

E0

½ es

Pn

i=1lnff1(Xi)

0(Xi)

¾

= E0 ( n

Y

i=1

µf1(Xi) f0(Xi)

s) .

(18)

Under H0,X1, . . . ,Xn are i.i.d., therefore αn E0

(Yn

i=1

µf1(Xi) f0(Xi)

s)

= Yn

i=1

E0

½µf1(Xi) f0(Xi)

s¾

= E0

½µf1(X1) f0(X1)

s¾n

= µZ

Rd

µf1(x1) f0(x1)

s

f0(x1)dx

n .

Since s > 0 is arbitrary, the first half of the lemma is proved, and the proof

of the second half is similar. 2

Remark 2. The Chernoff Lemma results in exponential rate of convergence if

infs>0

Z

Rd

f1(x)sf0(x)1−sdx<1 and

s>0inf Z

Rd

f0(x)sf1(x)1−sdx<1.

The Cauchy-Schwartz inequality implies that infs>0

Z

Rd

f1(x)sf0(x)1−sdx Z

Rd

f1(x)1/2f0(x)1/2dx

sZ

Rd

f1(x)dx Z

Rd

f0(x)dx

= 1,

with equality in the second inequality if and only if f0 = f1. Morover, one can check that the function

g(s) :=

Z

Rd

f1(x)sf0(x)1−sdx is convex such that g(0) = 1 and g(1) = 1, therefore

infs>0

Z

Rd

f1(x)sf0(x)1−sdx= inf

1>s>0

Z

Rd

f1(x)sf0(x)1−sdx.

(19)

The quantity

He(f0, f1) = Z

Rd

f1(x)1/2f0(x)1/2dx (8) is called Hellinger integral. The previous derivations imply that

αn ≤He(f0, f1)n and

βn ≤He(f0, f1)n.

The squared Hellinger distanceDφ2(µ, ν) was introduced in Section 2.2. One can check that

Dφ2(µ, ν) = 2 (1−He(f0, f1)).

Remark 3. Besides the concept of α-level consistency, there is a new kind of consistency, calledstrong consistency, meaning that both onH0 and on its complement the tests make a.s. no error after a random sample size. In other words, denoting by P0 (resp. P1) the probability under the null hypothesis (resp. under the alternative), we have

P0{rejecting H0 for only finitely many n}= 1 (9) and

P1{accepting H0 for only finitely many n}= 1. (10) Because of the Chernoff bound, both errors tend to 0 exponentially fast, so the Borel-Cantelli Lemma implies that the maximum likelihood test is strongly consistent. In a real life problem, for example, when we get the data sequentially, one gets data just once, and should make good inference for these data. Strong consistency means that the single sequence of inference is a.s. perfect if the sample size is large enough. This concept is close to the definition of discernability introduced by Dembo and Peres [18]. For a discussion and references, we refer the reader to Devroye and Lugosi [23].

(20)

3 Testing simple versus composite hypothe- ses

3.1 Total variation and I-divergence

Ifµandνare probability distributions onRd(d1), then thetotal variation distance betweenµ and ν is defined by

V(µ, ν) = sup

A

|µ(A)−ν(A)|,

where the supremum is taken over all Borel sets A. The Scheff´e Theorem below shows that the total variation is the half of the L1 distance of the corresponding densities.

Theorem 5 (Scheff´e [55]) Ifµand ν are absolutely continuous with den- sities f and g, respectively, then

Z

Rd

|f(x)−g(x)|dx= 2V(µ, ν).

(The quantity

L1(f0, f1) = Z

Rd

|f(x)−g(x)|dx (11)

is called L1-distance, cf. Section 2.2.) Proof. Note that

V(µ, ν) = sup

A

|µ(A)−ν(A)|

= sup

A

¯¯

¯¯ Z

A

f Z

A

g

¯¯

¯¯

= sup

A

¯¯

¯¯ Z

A

(f −g)

¯¯

¯¯

= Z

f >g

(f −g)

= Z

g>f

(g−f)

= 1

2 Z

|f −g|.

(21)

2 The Scheff´e Theorem implies an equivalent definition of the total varia- tion:

V(µ, ν) = 1 2sup

{Aj}

X

j

|µ(Aj)−ν(Aj)|, (12) where the supremum is taken over all finite Borel measurable partitions{Aj}.

The information divergence (also called I-divergence, Kullback-Leibler number, relative entropy) of µ and ν is defined by

I(µ, ν) = sup

{Aj}

X

j

µ(Aj) lnµ(Aj)

ν(Aj), (13)

where the supremum is taken over all finite Borel measurable partitions{Aj}.

If the densities f and g exist then one can prove that I(µ, ν) = D(f, g) =

Z

Rd

f(x) lnf(x) g(x)dx.

The following inequality, called Pinsker’s inequality, gives an upper bound to the total variation in terms of I-divergence:

Theorem 6 ( Csisz´ar [14], Kullback [39] and Kemperman [38])

2{V(µ, ν)}2 ≤I(µ, ν). (14)

Proof. Applying the notations of the proof of the Scheff´e Theorem, put A ={f > g},

then the Scheff´e Theorem implies that

V(µ, ν) =µ(A)−ν(A).

Moreover, from (13) we get that I(µ, ν)≥µ(A) ln µ(A)

ν(A) + (1−µ(A)) ln1−µ(A) 1−ν(A) Introduce the notations

q=ν(A) and p=µ(A)> q,

(22)

and

hp(q) = plnp

q + (1−p) ln 1−p 1−q. then we have to prove that

2(p−q)2 ≤hp(q), which follows from the facts on the derivative:

d

dq(hp(q)2(p−q)2) = −p

q + 1−p

1−q + 4(p−q)

= p−q

q(1−q) + 4(p−q)

0.

2

3.2 Large deviation of L

1

distance

Consider the sample ofRd-valued random vectorsX1, . . . ,Xnwithi.i.d. com- ponents such that the common distribution is denoted by ν. For a fixed distribution µ, we consider the problem of testing hypotheses

H0 :ν =µversus H1 :ν 6=µ by means of test statistics Tn =Tn(X1, . . . ,Xn).

For testing a simple hypothesisH0 that the distribution of the sample is µ, versus a composite alternative, Gy¨orfi and van der Meulen [31] introduced a related goodness of fit test statistic Ln defined as

Ln =

mn

X

j=1

n(An,j)−µ(An,j)|,

whereµndenotes the empirical measures associated with the sampleX1, . . . ,Xn, so that

µn(A) = #{i:Xi ∈A, i= 1, . . . , n}

n

for any Borel subset A, and Pn = {An,1, . . . , An,mn} is a finite partition of Rd. These authors also showed that underH0

P(Ln ≥²)≤e−n(²82+o(1)).

Next we characterize the large deviation properties of Ln:

(23)

Theorem 7 (Beirlant, Devroye, Gy¨orfi and Vajda [6]). Assume that

n→∞lim max

j µ(An,j) = 0 (15)

and

n→∞lim

mnlnn

n = 0. (16)

Then for all 0< ² <2

n→∞lim 1

nlnP{Ln > ²}=−gL(²), (17) where

gL(²) = inf

0<p<1−²/2

µ

pln p

p+²/2+ (1−p) ln 1−p 1−p−²/2

. (18)

Remark 4. Note that a lower bound forgLfollows from Pinsker’s inequality (14) such that

gL(²)≥²2/2.

The best known lower bound is due to Toussaint [61]:

gL(²)≥²2/2 +²4/36 +²6/280.

An upper bound ˆg(²) of gL(²) can be obtained substituting p by 1−²/22 in definition of gL(²). Then

ˆ

g(²) = ²

2ln2 +²

2−² ≥gL(²)

(Vajda [66]). Further bounds can be found on p. 294-295 in Vajda [65].

Remark that also in Lemma 5.1 in Bahadur [2] it was observed that gL(²) = ²2

2(1 +o(1)) as ²→0. The observations above mean that

P{Ln> ε} ≈e−ngL(ε)≤e−nε2/2.

In the proof of Theorem 7 we shall use the following lemma.

(24)

Lemma 2 (Sanov [54], see p. 16 in Dembo, Zeitouni [19], or Prob- lem 1.2.11 in Csisz´ar and K¨orner [15]). LetΣbe a finite set (alphabet), Ln be a set of types (possible empirical distributions) onΣ, and letΓ be a set of distributions on Σ. If Z1, . . . , Zn are i.i.d. random variables taking values in Σ and with distribution µ and µn denotes the empirical distribution then

¯¯

¯¯1

nlnP{µn Γ}+ inf

τ∈Γ∩Ln

I(τ,µ¯n)

¯¯

¯¯ |Σ|ln(n+ 1)

n (19)

where |Σ| denotes the cardinality of Σ.

Proof. Without loss of generality assume that Σ = {1, . . . , m}. We shall prove that

P{µnΓ} ≤ |Ln|e−nminτ∈ΓI(τ,µ) and

P{µnΓ} ≥ 1

|Ln|e−nminτ∈ΓI(τ,µ). Because of our assumptions

P{Z1 =z1, . . . Zn=zn} = Yn

i=1

P{Zi =zi}

= Yn

i=1

µ(zi)

= ePni=1lnµ(zi)

= ePni=1Pmj=1Izi=jlnµ(zi)

= ePni=1Pmj=1Izi=jlnµ(j)

= ePmj=1n(j) lnµ(j)

= e−n(H(µn)+I(µn,µ))

=: Pµ(zn1),

where H(µn) stands for the Shannon entropy for the distribution µn. For any probability distribution τ ∈ Ln we can define a probability distribution Pτ(z1n) in this way:

Pτ(zn1) :=e−n(H(µn)+I(µn,τ)). Put

Tn(τ) ={z1n :µn(zn1) =τ},

(25)

then

1≥Pτn=τ}=Pτ{z1n ∈Tn(τ)}=|Tn(τ)|e−nH(τ) therefore

|Tn(τ)| ≤enH(τ), which implies the upper bound:

P{µnΓ} = X

τ∈Γ

Pµn =τ}

≤ |Ln|max

τ∈Γ Pµn=τ}

= |Ln|max

τ∈Γ |Tn(τ)|e−n(H(τ)+I(τ,µ))

≤ |Ln|max

τ∈Γ e−nI(τ,µ)

= |Ln|e−nminτ∈ΓI(τ,µ).

Concerning the lower bound notice that for any probability distribution ν Ln

Pτn =τ}

Pτn =ν} = |Tn(τ)|Q

a∈Στ(a)(a)

|Tn(ν)|Q

a∈Στ(a)nν(a)

= Y

a∈Σ

(nν(a))!

(nτ(a))!τ(a)n(τ(a)−ν(a))

1.

This last inequality can be seen as follows: the terms of the last product are of the forms m!l! ¡l

n

¢l−m

. It is easy to check that m!l! ≥lm−l, therefore Y

a∈Σ

(nν(a))!

(nτ(a))!τ(a)n(τ(a)−ν(a)) Y

a∈Σ

nn(τ(a)−ν(a)) =nn(Pa∈Στ(a)−Pa∈Σν(a))= 1.

It implies that

Pτn=τ} ≥Pτn =ν}

and thus

1 = X

ν

Pτn =ν}

≤ |Ln|Pτn =τ}

= |Ln||Tn(τ)|e−nH(τ),

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

This department has, on former occasions, informed the ministers of foreign powers that a communication from the President to either House of Congress is regarded as a

I examine the structure of the narratives in order to discover patterns of memory and remembering, how certain parts and characters in the narrators’ story are told and

Major research areas of the Faculty include museums as new places for adult learning, development of the profession of adult educators, second chance schooling, guidance

The decision on which direction to take lies entirely on the researcher, though it may be strongly influenced by the other components of the research project, such as the

In this article, I discuss the need for curriculum changes in Finnish art education and how the new national cur- riculum for visual art education has tried to respond to

A felsőfokú oktatás minőségének és hozzáférhetőségének együttes javítása a Pannon Egyetemen... Introduction to the Theory of

In the case of a-acyl compounds with a high enol content, the band due to the acyl C = 0 group disappears, while the position of the lactone carbonyl band is shifted to

In the first piacé, nőt regression bút too much civilization was the major cause of Jefferson’s worries about America, and, in the second, it alsó accounted