The theory of statistical decisions

(1)

The theory of statistical decisions

László Györfi

Department of Computer Science and Information Theory Budapest University of Technology and Economics

1521 Stoczek u. 2, Budapest, Hungary gyorfi@cs.bme.hu

March 14, 2012

Abstract

In this note we survey the theory of statistical decisions, i.e., consider statistical inferences, where the target of the inference takes finitely many values. For the formulation of the Bayes decision, the aim is to minimize the weighted average of conditional error probabilities. In the scheme of simple statistical hypotheses testing we constrain a conditional error probability and minimize the other one.

Study the composite hypotheses, the testing of homogeneity and the testing of independence, too. In the analysis the divergences (L₁- distance, I-divergence, Hellinger distance, etc.) between probability distributions play an important role.

(2)

1 Bayes decision

1.1 Bayes risk

For the statistical inference, a d-dimensional observation vector X is given, and based on X, the statistician has to make an inference on a random variableY, which takes finitely many values, i.e., it takes values from the set {1,2, . . . , m}. In fact, the inference is a decision formulated by a decision function

g :R^d→ {1,2, . . . , m}.

If g(X)6=Y then the decision makes error.

In the formulation of the Bayes decision problem, introduce a cost function C(y, y⁰) ≥ 0, which is the cost if the label Y = y and the decision g(X) = y⁰. For a decision function g, the risk is the expectation of the cost:

R(g) =E{C(Y, g(X))}.

In Bayes decision problem, the aim is to minimize the risk, i.e., the goal is to find a function g^∗ :R^d→ {1,2, . . . , m} such that

R(g^∗) = min

g:R^d→{1,2,...,m}R(g), (1) where g^∗ is called the Bayes decision function, and R^∗ =R(g^∗) is the Bayes risk.

For the posteriori probabilities, introduce the notations:

P_y(X) = P{Y =y|X}.

Let the decision function g^∗ be defined by g^∗(X) = arg min

y⁰

Xm

y=1

C(y, y⁰)Py(X).

If arg min is not unique then choose the smallest y⁰, which minimizes P_m

y=1C(y, y⁰)P_y(X). This definition implies that for any decision functiong, Xm

y=1

C(y, g^∗(X))Py(X)≤ Xm

y=1

C(y, g(X))Py(X). (2)

(4)

Theorem 1 For any decision function g, we have that R(g^∗)≤R(g).

Proof. For a decision function g, let’s calculate the risk.

R(g) = E{C(Y, g(X))}

= E{E{C(Y, g(X))|X}}

= E

( _m X

y=1

Xm

y⁰=1

C(y, y⁰)P{Y =y, g(X) = y⁰ |X}

)

= E

(Xm

y=1

Xm

y⁰=1

C(y, y⁰)I_{g(X)=y⁰_}P{Y =y|X}

)

= E

( _m X

y=1

C(y, g(X))P_y(X) )

, where I denotes the indicator. (2) implies that

R(g) = E ( _m

X

y=1

C(y, g(X))P_y(X) )

≥ E (Xm

y=1

C(y, g^∗(X))P_y(X) )

= R(g^∗).

2 Concerning the cost function, the most frequently studied example is the so called 0−1 loss:

C(y, y⁰) =

½ 1 if y6=y⁰, 0 if y=y⁰.

For the 0−1 loss, the corresponding risk is the error probability:

R(g) =E{C(Y, g(X))}=E{I_{Y_6=g(X)}}=P{Y 6=g(X)}, and the Bayes decision is of form

g^∗(X) = arg min

y⁰

Xm

y=1

C(y, y⁰)P_y(X) = arg min

y⁰

X

y6=y⁰

P_y(X) = arg max

y⁰

P_y⁰(X),

(5)

which is called maximum posteriori decision, too.

If the distribution of the observation vectorXhas density, then the Bayes decision has an equivalent formulation. Introduce the notations for density of X by

P{X ∈B}= Z

B

f(x)dx and for the conditional densities by

P{X∈B |Y =y}= Z

B

f_y(x)dx and for a priori probabilities

q_y =P{Y =y}, then it is easy to check that

P_y(X) = P{Y =y |X=x}= q_yf_y(x) f(x) and therefore

g^∗(x) = arg min

y⁰

Xm

y=1

C(y, y⁰)P_y(x)

= arg min

y⁰

Xm

y=1

C(y, y⁰)q_yf_y(x) f(x)

= arg min

y⁰

Xm

y=1

C(y, y⁰)qyfy(x).

From the proof of Theorem 1 we may derive a formula for the optimal risk:

R(g^∗) = E (

miny⁰

Xm

y=1

C(y, y⁰)Py(X) )

.

(6)

If X has density then

R(g^∗) = E (

miny⁰

Xm

y=1

C(y, y⁰)q_yf_y(X) f(X)

)

= Z

R^d

miny⁰

Xm

y=1

C(y, y⁰)q_yf_y(x)

f(x) f(x)dx

= Z

R^d

miny⁰

Xm

y=1

C(y, y⁰)q_yf_y(x)dx.

For the 0−1 loss, we get that R(g^∗) = E

½

miny⁰ (1−Py⁰(X))

¾ , which has the form, for densities,

R(g^∗) = Z

R^d

miny⁰ (f(x)−q_y⁰f_y⁰(x))dx= 1− Z

R^d

maxy⁰ q_y⁰f_y⁰(x)dx.

1.2 Approximation of Bayes decision

In practice, the posteriori probabilities{P_y(X)}are unknown. If we are given some approximations{Pˆ_y(X)}, from which one may derive some approximate decision

ˆ

g(X) = arg min

y⁰

Xm

y=1

C(y, y⁰) ˆP_y(X) then the question is how well R(ˆg) approximates R^∗. Lemma 1 Put C_max= max_y,y⁰C(y, y⁰), then

0≤R(ˆg)−R(g^∗)≤2Cmax

Xm

y=1

E n

|Py(X)−Pˆy(X)|

o .

(7)

Proof. We have that R(ˆg)−R(g^∗) = E

(Xm

y=1

C(y,g(X))Pˆ _y(X) )

−E (Xm

y=1

C(y, g^∗(X))P_y(X) )

= E

(Xm

y=1

C(y,g(X))Pˆ _y(X)− Xm

y=1

C(y,g(X)) ˆˆ P_y(X) )

+E (Xm

y=1

C(y,g(X)) ˆˆ P_y(X)− Xm

y=1

C(y, g^∗(X)) ˆP_y(X) )

+E ( _m

X

y=1

C(y, g^∗(X)) ˆP_y(X)− Xm

y=1

C(y, g^∗(X))P_y(X) )

. The definition of ˆg implies that

Xm

y=1

C(y,g(X)) ˆˆ P_y(X)− Xm

y=1

C(y, g^∗(X)) ˆP_y(X)≤0, therefore

R(ˆg)−R(g^∗) ≤ E ( _m

X

y=1

C(y,g(X))|Pˆ _y(X)−Pˆ_y(X)|

)

+E ( _m

X

y=1

C(y, g^∗(X))|Pˆ_y(X)−P_y(X)|

)

≤ 2C_max Xm

y=1

En

|P_y(X)−Pˆ_y(X)|o .

2 In the special case of the approximate maximum posteriori decision the inequality in Lemma 1 can be slightly improved:

0≤R(ˆg)−R(g^∗)≤ Xm

y=1

E n

|Py(X)−Pˆy(X)|

o .

Based on this relation, one can introduce efficient pattern recognition rules.

(For the details, see Devroye, Gy¨orfi, and Lugosi [21].)

(8)

2 Testing simple hypotheses

2.1 α-level tests

In this section we consider decision problems, where the consequences of the various errors are very much different. For example, if in a diagnostic problem Y = 0 means that the patient is OK, while Y = 1 means that the patient is ill, then for Y = 0 the false decision is that the patient is ill, which implies some superfluous medical treatment, while for Y = 1 the false decision is that the illness is not detected, and the patient’s state may become worse.

A similar situation happens for radar detection.

The event Y = 0 is called null hypothesis and is denoted by H₀, and the event Y = 1 is called alternative hypothesis and is denoted by H₁. The decision, the test is formulated by a set A ⊂ R^d, called acceptance region such that accept H₀ if X ∈A, otherwise reject H₀, i.e., accept H₁. The set A^c is called critical region.

Let P0 and P1 be the probability distributions of X under H0 and H1, respectively. There are two types of errors:

• Error of the first kind, if under the null hypothesis H₀ we reject H₀. This error is P0(A^c).

• Error of the second kind, if under the alternative hypothesis H₁ we reject H₁. This error is P₁(A).

Obviously, one decreases the error of the first kind P₀(A^c) if the error of the second kindP₁(A) increases. We can formulate the optimization problem such that minimize the error of the second kind under the condition that the error of the first kind is at most 0< α <1:

A:Pmin0(A^c)≤αP₁(A). (3)

In order to solve this problem the Neyman-Pearson Lemma plays an important role.

Theorem 2 (Neyman, Pearson [45]) Assume that the distributions P₀ and P1 have densities f0 and f1:

P₀(B) = Z

B

f₀(x)dx and P₁(B) = Z

B

f₁(x)dx.

(9)

For a γ >0, put

A_γ ={x:f₀(x)≥γf₁(x)}.

If for any set A

P0(A^c)≤P0(A^c_γ) then

P₁(A)≥P₁(A_γ).

Proof. Because of the condition of the theorem, we have the following chain of inequalities:

P₀(A^c) ≤ P₀(A^c_γ)

P0(A^c∩Aγ) +P0(A^c∩A^c_γ) ≤ P0(A∩A^c_γ) +P0(A^c∩A^c_γ) Z

A^c∩Aγ

f₀(x)dx ≤ Z

A∩A^c_γ

f₀(x)dx.

The definition of Aγ implies that γ

Z

A^c∩Aγ

f1(x)dx≤ Z

A^c∩Aγ

f0(x)dx≤ Z

A∩A^cγ

f0(x)dx≤γ Z

A∩A^cγ

f1(x)dx, therefore using the previous chain of derivations in a reverse order we get that

P1(A^c)≤P1(A^c_γ).

2 If for an 0< α <1 there is aγ =γ(α), which solves the equation

P₀(A^c_γ) =α,

then the Neyman-Pearson Lemma implies that in order to solve the problem (3), it is enough to search for set of form A_γ, i.e.,

A:Pmin0(A^c)≤αP₁(A) = min

Aγ:P0(A^cγ)≤αP₁(A_γ).

Then A_γ is called the most powerful α-level test.

Because of the Neyman-Pearson Lemma, we introduce the likelihood ratio statistic

T(X) = f₀(X) f₁(X),

(10)

and so the null hypothesis H0 is accepted if T(X)≥γ.

Example 1. As an illustration of the Neyman-Pearson Lemma, consider the example of an experiment, where the null hypothesis is that the components of X are i.i.d. normal with mean m = m0 > 0 and with variance σ², while under the alternative hypothesis the components of Xare i.i.d. normal with mean m₁ = 0 and with the same varianceσ². Then

f0(x) =f0(x1, . . . , xd) = Yd

i=1

µ 1

√2πσe⁻⁽^xi^2σ^−m)2²

¶

and

f₁(x) =f₁(x₁, . . . , x_d) = Yd

i=1

µ 1

√2πσe⁻^x

2i 2σ2

¶

and f₀(X)

f1(X) ≥γ means that

− Xd

i=1

(Xi−m)² 2σ² +

Xd

i=1

X_i²

2σ² ≥lnγ, or equivalently,

Xd

i=1

(2X_im−m²)≥2σ²lnγ.

This test accepts the null hypothesis if 1

d Xd

i=1

X_i ≥ 2σ²lnγ/d+m²

2m = σ²lnγ dm +m

2 =:γ⁰. This test is based on the linear statistic P_d

i=1X_i/d, and the question left is how to choose the critical value γ⁰, for which it is an α-level test, i.e., the error of the first kind is α:

P₀ (1

d Xd

i=1

X_i ≤γ⁰ )

=α.

(11)

Under the null hypothesis, the distribution of ¹_dP_d

i=1Xiis normal with mean m and with variance σ²/d, therefore

P₀ (

1 d

Xd

i=1

X_i ≤γ⁰ )

= Φ

µγ⁰−m σ/√

d

¶ ,

where Φ denotes the standard normal distribution function, and so the critical value γ⁰ of anα-level test solves the equation

Φ µ

−m−γ⁰ σ/√

d

¶

=α, i.e.,

γ⁰ =m−Φ⁻¹(1−α)σ/√ d.

Remark 1. In many situations, when dis large enough, one can refer to the central limit theorem such that the log-likelihood ratio

lnf₀(X) f₁(X)

is asymptotically normal. The argument of Example 1 can be extended if under H₀, the log-likelihood ratio is approximately normal with mean m₀ and with variance σ₀². Let the test be defined such that it accepts H₀ if

lnf₀(X) f₁(X) ≥γ⁰, where

γ⁰ =m₀−Φ⁻¹(1−α)σ₀. Then this test is approximately an α-level test.

2.2 φ-divergences

In the analysis of repeated observations the divergences between distribution play an important role. Imre Csisz´ar [14] introduced the concept of φ-divergences. Let φ: (0,∞)→R be a convex function, extended on [0,∞) by continuity such that φ(1) = 0. For the probability distributions µand ν,

(12)

let λ be a σ-finite dominating measure of µ and ν, for example, λ = µ+ν.

Introduce the notations

f = dµ dλ and

g = dν dλ. Then the φ-divergenceof µ and ν is defined by

D_φ(µ, ν) = Z

R^d

φ

µf(x) g(x)

¶

g(x)λ(dx). (4)

The Jensen inequality implies the most important property of the φ- divergences:

D_φ(µ, ν) = Z

R^d

φ

µf(x) g(x)

¶

g(x)λ(dx)≥φ µZ

R^d

f(x)

g(x)g(x)λ(dx)

¶

=φ(1) = 0.

It means that D_φ(µ, ν)≥ 0 and if µ=ν then D_φ(µ, ν) = 0. If, in addition, φ is strictly convex at 1 thenD_φ(µ, ν) = 0 iff µ=ν.

Next we show some examples.

• For

φ₁(t) =|t−1|, we get the L₁ distance

D_φ₁(µ, ν) = Z

R^d

|f(x)−g(x)|λ(dx).

• For

φ₂(t) = (√

t−1)², we get the squared Hellinger distance

D_φ₂(µ, ν) = Z

R^d

³pf(x)−p g(x)

´₂ λ(dx)

= 2 µ

1− Z

R^d

pf(x)g(x)λ(dx)

¶ .

(13)

• For

φ₃(t) =−lnt, we get the I-divergence

I(µ, ν) = Dφ3(µ, ν) = Z

R^d

ln

µg(x) f(x)

¶

g(x)λ(dx).

• For

φ₄(t) = (t−1)², we get the χ²-divergence

χ²(µ, ν) =D_φ₄(µ, ν) = Z

R^d

(f(x)−g(x))²

g(x) λ(dx).

An equivalent definition of theφ-divergence is D_φ(µ, ν) = sup

P

X

j

φ

µµ(A_j) ν(A_j)

¶

ν(A_j), (5)

where the supremum is taken over all finite Borel measurable partitions P = {Aj} of R^d.

The main reasoning of this equivalence is that for any partitionP ={A_j}, the Jensen inequality implies that

D_φ(µ, ν) = Z

R^d

φ

µf(x) g(x)

¶

g(x)λ(dx)

= X

j

Z

Aj

φ

µf(x) g(x)

¶

g(x)λ(dx)

= X

j

1 ν(A_j)

Z

Aj

φ

µf(x) g(x)

¶

g(x)λ(dx)ν(Aj)

≥ X

j

φ Ã

1 ν(A_j)

Z

Aj

f(x)

g(x)g(x)λ(dx)

! ν(A_j)

= X

j

φ

µµ(Aj) ν(A_j)

¶

ν(A_j).

(14)

The sequence of partitionsP1,P2, . . . is called nested if any cellA∈ Pn+1

is a subset of a cell A⁰ ∈ P_n. Next we show that for nested sequence of partitions

X

A∈Pn

φ

µµ(A) ν(A)

¶

ν(A)↑.

Again, this property is the consequence of the Jensen inequality:

X

A⁰∈Pn+1

φ

µµ(A⁰) ν(A⁰)

¶

ν(A⁰) = X

A∈Pn



 X

A⁰∈Pn+1,A⁰⊂A

φ

µµ(A⁰) ν(A⁰)

¶ ν(A⁰)





= X

A∈Pn



 X

φ

µµ(A⁰) ν(A⁰)

¶ν(A⁰) ν(A)



ν(A)

≥ X

A∈Pn

φ



 X

µ(A⁰) ν(A⁰)

ν(A⁰) ν(A)



ν(A)

= X

A∈Pn

φ

µµ(A) ν(A)

¶ ν(A).

It implies that there is a nested sequence of partitions P₁,P₂, . . . such that X

A∈Pn

φ

µµ(A) ν(A)

¶

ν(A)↑sup

P

X

A∈P

φ

µµ(A) ν(A)

¶ ν(A).

The sequence of partitions P1,P2, . . . is called asymptotically fine if for any sphere S centered at the origin

n→∞lim max

A∈Pn,A∩S6=0diam(A) = 0. (6)

One can show that if the nested sequence of partitions P1,P2, . . . is asymptotically fine then

X

A∈Pn

φ

µµ(A) ν(A)

¶

ν(A)↑ Z

R^d

φ

µf(x) g(x)

¶

g(x)λ(dx).

This final step is verified in the particular case of L1 distance. (Cf. Section 3.3.) In general, we may introduce a cell wise constant approximation of ^f_g(x)^(x):

Fn(x) := µ(A)

ν(A) if x∈A.

(15)

Thus,

X

A∈Pn

φ

µµ(A) ν(A)

¶

ν(A) = Z

R^d

φ(Fn(x))g(x)λ(dx) and

F_n(x)→ f(x) g(x) for almost all x modλ withg(x)>0 such that

Z

R^d

φ(Fn(x))g(x)λ(dx)→ Z

R^d

φ

µf(x) g(x)

¶

g(x)λ(dx).

2.3 Repeated observations

The error probabilities can be decreased if instead of an observation vectorX, we are given n vectorsX₁, . . . ,X_n such that under H₀,X₁, . . . ,X_n are inde- pendent and identically distributed (i.i.d.) with distribution P₀, while under H₁, X₁, . . . ,X_n are i.i.d. with distribution P₁. In this case the likelihood ratio statistic is of form

T(X) = f₀(X₁)· . . . ·f₀(X_n) f1(X1)· . . . ·f1(Xn).

The Stein Lemma below says that there are tests, for which both the error of the first kind αn and the error of the second kind βn tend to 0, if n → ∞.

In order to formulate the Stein Lemma, we introduce the I-divergence (called also relative entropy)

D(f₀, f₁) = Z

R^d

f₀(x) lnf₀(x)

f₁(x)dx, (7)

(cf. Section 2.2).

The I-divergence is always non-negative:

−D(f₀, f₁) = Z

R^d

f₀(x) lnf₁(x) f0(x)dx≤

Z

R^d

f₀(x)

µf₁(x) f0(x)−1

¶

dx= 0.

Theorem 3 (Stein [58]) For any 0 < δ < D(f₀, f₁), there is a test such that the error of the first kind

α_n →0,

(16)

and for the error of the second kind

βn ≤e^−n(D(f⁰^,f¹^)−δ) →0.

Proof. Construct a test such that accept the null hypothesis H₀ if f₀(X₁)· . . . ·f₀(X_n)

f₁(X₁)· . . . ·f₁(X_n) ≥e^n(D(f⁰^,f¹^)−δ), or equivalently

1 n

Xn

i=1

lnf0(Xi)

f₁(X_i) ≥D(f₀, f₁)−δ.

Under H₀, the strong law of large numbers implies that 1

n Xn

i=1

lnf₀(X_i)

f₁(X_i) →D(f0, f1)

almost surely (a.s.), therefore for the error of the first kind α_n, we get that α_n=P₀

( 1 n

Xn

i=1

lnf0(Xi)

f₁(X_i) < D(f₀, f₁)−δ )

→0.

Concerning the error of the second kind β_n we have the following simple bound:

β_n

= P₁

½f₀(X₁)· . . . ·f₀(X_n)

f1(X1)· . . . ·f1(Xn) ≥e^n(D(f⁰^,f¹^)−δ)

¾

= Z

n_f

0(x1)·...·f0(xn)

f1(x1)·...·f1(xn)≥e^n(D(f⁰^,f¹⁾^−δ)of₁(x₁)· . . . ·f₁(x_n)dx₁, . . . , dx_n

≤ e^−n(D(f⁰^,f¹^)−δ) Z

nf0(x1)·...·f0(xn)

f1(x1)·...·f1(xn)≥e^n(D(f⁰^,f¹⁾^−δ)of₀(x₁)· . . . ·f₀(x_n)dx₁, . . . , dx_n

≤ e^−n(D(f⁰^,f¹^)−δ).

2 The critical value of the test in the proof of the Stein Lemma used the I- divergenceD(f0, f1). Without knowingD(f0, f1), the Chernoff Lemma below results in exponential rate of convergence of the errors.

(17)

Theorem 4 (Chernoff [12]). Construct a test such that accept the null hypothesis H₀ if

f₀(X₁)· . . . ·f₀(X_n) f₁(X₁)· . . . ·f₁(X_n) ≥1, or equivalently

Xn

i=1

lnf0(Xi) f₁(X_i) ≥0.

(This test is called maximum likelihood test.) Then α_n≤

µ infs>0

Z

R^d

f₁(x)^sf₀(x)^1−sdx

¶_n

and

β_n ≤ µ

infs>0

Z

R^d

f₀(x)^sf₁(x)^1−sdx

¶_n .

Proof. Apply the Chernoff bounding technique such that for any s > 0 the Markov inequality implies that

α_n = P₀ ( _n

X

i=1

lnf₀(X_i) f1(Xi) <0

)

= P₀ (

s Xn

i=1

lnf₁(X_i) f0(Xi) >0

)

= P₀

½ e^s

P_n

i=1ln^f_f¹⁽^Xⁱ⁾

0(Xi) >1

¾

≤ E₀

½ e^s

P_n

i=1ln^f_f¹⁽^Xⁱ⁾

0(Xi)

¾

= E₀ ( _n

Y

i=1

µf₁(X_i) f0(Xi)

¶_s) .

(18)

Under H0,X1, . . . ,Xn are i.i.d., therefore α_n ≤ E₀

(Yn

i=1

µf₁(X_i) f₀(X_i)

¶_s)

= Yn

i=1

E₀

½µf1(Xi) f₀(X_i)

¶_s¾

= E₀

½µf₁(X₁) f₀(X₁)

¶_s¾_n

= µZ

R^d

µf₁(x₁) f₀(x₁)

¶_s

f₀(x₁)dx

¶_n .

Since s > 0 is arbitrary, the first half of the lemma is proved, and the proof

of the second half is similar. 2

Remark 2. The Chernoff Lemma results in exponential rate of convergence if

infs>0

Z

R^d

f₁(x)^sf₀(x)^1−sdx<1 and

s>0inf Z

R^d

f₀(x)^sf₁(x)^1−sdx<1.

The Cauchy-Schwartz inequality implies that infs>0

Z

R^d

f1(x)^sf0(x)^1−sdx ≤ Z

R^d

f1(x)^1/2f0(x)^1/2dx

≤ sZ

R^d

f₁(x)dx Z

R^d

f₀(x)dx

= 1,

with equality in the second inequality if and only if f₀ = f₁. Morover, one can check that the function

g(s) :=

Z

R^d

f₁(x)^sf₀(x)^1−sdx is convex such that g(0) = 1 and g(1) = 1, therefore

infs>0

Z

R^d

f₁(x)^sf₀(x)^1−sdx= inf

1>s>0

Z

R^d

f₁(x)^sf₀(x)^1−sdx.

(19)

The quantity

He(f₀, f₁) = Z

R^d

f₁(x)^1/2f₀(x)^1/2dx (8) is called Hellinger integral. The previous derivations imply that

αn ≤He(f0, f1)ⁿ and

β_n ≤He(f₀, f₁)ⁿ.

The squared Hellinger distanceD_φ₂(µ, ν) was introduced in Section 2.2. One can check that

D_φ₂(µ, ν) = 2 (1−He(f₀, f₁)).

Remark 3. Besides the concept of α-level consistency, there is a new kind of consistency, calledstrong consistency, meaning that both onH₀ and on its complement the tests make a.s. no error after a random sample size. In other words, denoting by P₀ (resp. P₁) the probability under the null hypothesis (resp. under the alternative), we have

P₀{rejecting H₀ for only finitely many n}= 1 (9) and

P₁{accepting H₀ for only finitely many n}= 1. (10) Because of the Chernoff bound, both errors tend to 0 exponentially fast, so the Borel-Cantelli Lemma implies that the maximum likelihood test is strongly consistent. In a real life problem, for example, when we get the data sequentially, one gets data just once, and should make good inference for these data. Strong consistency means that the single sequence of inference is a.s. perfect if the sample size is large enough. This concept is close to the definition of discernability introduced by Dembo and Peres [18]. For a discussion and references, we refer the reader to Devroye and Lugosi [23].

(20)

3 Testing simple versus composite hypothe- ses

3.1 Total variation and I-divergence

Ifµandνare probability distributions onR^d(d≥1), then thetotal variation distance betweenµ and ν is defined by

V(µ, ν) = sup

A

|µ(A)−ν(A)|,

where the supremum is taken over all Borel sets A. The Scheff´e Theorem below shows that the total variation is the half of the L₁ distance of the corresponding densities.

Theorem 5 (Scheff´e [55]) Ifµand ν are absolutely continuous with den- sities f and g, respectively, then

Z

R^d

|f(x)−g(x)|dx= 2V(µ, ν).

(The quantity

L₁(f₀, f₁) = Z

R^d

|f(x)−g(x)|dx (11)

is called L₁-distance, cf. Section 2.2.) Proof. Note that

V(µ, ν) = sup

A

|µ(A)−ν(A)|

= sup

A

¯¯

¯¯ Z

A

f − Z

A

g

¯¯

= sup

A

¯¯

¯¯ Z

A

(f −g)

¯¯

= Z

f >g

(f −g)

= Z

g>f

(g−f)

= 1

2 Z

|f −g|.

(21)

2 The Scheff´e Theorem implies an equivalent definition of the total variation:

V(µ, ν) = 1 2sup

{Aj}

X

j

|µ(A_j)−ν(A_j)|, (12) where the supremum is taken over all finite Borel measurable partitions{A_j}.

The information divergence (also called I-divergence, Kullback-Leibler number, relative entropy) of µ and ν is defined by

I(µ, ν) = sup

{Aj}

X

j

µ(Aj) lnµ(A_j)

ν(A_j), (13)

where the supremum is taken over all finite Borel measurable partitions{A_j}.

If the densities f and g exist then one can prove that I(µ, ν) = D(f, g) =

Z

R^d

f(x) lnf(x) g(x)dx.

The following inequality, called Pinsker’s inequality, gives an upper bound to the total variation in terms of I-divergence:

Theorem 6 ( Csisz´ar [14], Kullback [39] and Kemperman [38])

2{V(µ, ν)}² ≤I(µ, ν). (14)

Proof. Applying the notations of the proof of the Scheff´e Theorem, put A^∗ ={f > g},

then the Scheff´e Theorem implies that

V(µ, ν) =µ(A^∗)−ν(A^∗).

Moreover, from (13) we get that I(µ, ν)≥µ(A^∗) ln µ(A^∗)

ν(A^∗) + (1−µ(A^∗)) ln1−µ(A^∗) 1−ν(A^∗) Introduce the notations

q=ν(A^∗) and p=µ(A^∗)> q,

(22)

and

h_p(q) = plnp

q + (1−p) ln 1−p 1−q. then we have to prove that

2(p−q)² ≤h_p(q), which follows from the facts on the derivative:

d

dq(h_p(q)−2(p−q)²) = −p

q + 1−p

1−q + 4(p−q)

= − p−q

q(1−q) + 4(p−q)

≤ 0.

2

3.2 Large deviation of L

1

distance

Consider the sample ofR^d-valued random vectorsX₁, . . . ,X_nwithi.i.d. components such that the common distribution is denoted by ν. For a fixed distribution µ, we consider the problem of testing hypotheses

H0 :ν =µversus H1 :ν 6=µ by means of test statistics T_n =T_n(X₁, . . . ,X_n).

For testing a simple hypothesisH0 that the distribution of the sample is µ, versus a composite alternative, Gy¨orfi and van der Meulen [31] introduced a related goodness of fit test statistic L_n defined as

L_n =

mn

X

j=1

|µ_n(A_n,j)−µ(A_n,j)|,

whereµ_ndenotes the empirical measures associated with the sampleX₁, . . . ,X_n, so that

µ_n(A) = #{i:X_i ∈A, i= 1, . . . , n}

n

for any Borel subset A, and P_n = {A_n,1, . . . , A_n,m_n} is a finite partition of R^d. These authors also showed that underH₀

P(L_n ≥²)≤e⁻ⁿ⁽^²⁸²^+o(1)).

Next we characterize the large deviation properties of L_n:

(23)

Theorem 7 (Beirlant, Devroye, Gy¨orfi and Vajda [6]). Assume that

n→∞lim max

j µ(A_n,j) = 0 (15)

and

n→∞lim

m_nlnn

n = 0. (16)

Then for all 0< ² <2

n→∞lim 1

nlnP{Ln > ²}=−gL(²), (17) where

g_L(²) = inf

0<p<1−²/2

µ

pln p

p+²/2+ (1−p) ln 1−p 1−p−²/2

¶

. (18)

Remark 4. Note that a lower bound forgLfollows from Pinsker’s inequality (14) such that

g_L(²)≥²²/2.

The best known lower bound is due to Toussaint [61]:

g_L(²)≥²²/2 +²⁴/36 +²⁶/280.

An upper bound ˆg(²) of g_L(²) can be obtained substituting p by ^1−²/2₂ in definition of g_L(²). Then

ˆ

g(²) = ²

2ln2 +²

2−² ≥g_L(²)

(Vajda [66]). Further bounds can be found on p. 294-295 in Vajda [65].

Remark that also in Lemma 5.1 in Bahadur [2] it was observed that g_L(²) = ²²

2(1 +o(1)) as ²→0. The observations above mean that

P{L_n> ε} ≈e^−ng^L^(ε)≤e^−nε²^/2.

In the proof of Theorem 7 we shall use the following lemma.

(24)

Lemma 2 (Sanov [54], see p. 16 in Dembo, Zeitouni [19], or Prob- lem 1.2.11 in Csisz´ar and K¨orner [15]). LetΣbe a finite set (alphabet), L_n be a set of types (possible empirical distributions) onΣ, and letΓ be a set of distributions on Σ. If Z1, . . . , Zn are i.i.d. random variables taking values in Σ and with distribution µ and µ_n denotes the empirical distribution then

¯¯

¯¯1

nlnP{µ^∗_n ∈Γ}+ inf

τ∈Γ∩Ln

I(τ,µ¯_n)

¯¯

¯¯≤ |Σ|ln(n+ 1)

n (19)

where |Σ| denotes the cardinality of Σ.

Proof. Without loss of generality assume that Σ = {1, . . . , m}. We shall prove that

P{µn∈Γ} ≤ |Ln|e⁻ⁿ^min^τ∈Γ^I(τ,µ) and

P{µ_n∈Γ} ≥ 1

|Ln|e⁻ⁿ^min^τ^∈Γ^I(τ,µ). Because of our assumptions

P{Z₁ =z₁, . . . Z_n=z_n} = Yn

i=1

P{Z_i =z_i}

= Yn

i=1

µ(z_i)

= e^Pⁿⁱ⁼¹^ln^µ(zⁱ⁾

= e^Pⁿⁱ⁼¹^P^m^j=1^I^zi^=j^ln^µ(zⁱ⁾

= e^Pⁿⁱ⁼¹^P^m^j=1^I^zi^=j^ln^µ(j)

= e^P^m^j=1^nµⁿ^{(j) ln}^µ(j)

= e^−n(H(µⁿ^)+I(µⁿ^,µ))

=: Pµ(zⁿ₁),

where H(µ_n) stands for the Shannon entropy for the distribution µ_n. For any probability distribution τ ∈ L_n we can define a probability distribution Pτ(z₁ⁿ) in this way:

P_τ(zⁿ₁) :=e^−n(H(µⁿ^)+I(µⁿ^,τ)). Put

T_n(τ) ={z₁ⁿ :µ_n(zⁿ₁) =τ},

(25)

then

1≥P_τ{µ_n=τ}=P_τ{z₁ⁿ ∈T_n(τ)}=|T_n(τ)|e^−nH(τ) therefore

|T_n(τ)| ≤e^nH(τ), which implies the upper bound:

P{µn∈Γ} = X

τ∈Γ

Pµ{µn =τ}

≤ |Ln|max

τ∈Γ Pµ{µn=τ}

= |Ln|max

τ∈Γ |Tn(τ)|e^−n(H(τ)+I(τ,µ))

≤ |Ln|max

τ∈Γ e^−nI(τ,µ)

= |L_n|e⁻ⁿ^min^τ∈Γ^I(τ,µ).

Concerning the lower bound notice that for any probability distribution ν ∈ L_n

P_τ{µ_n =τ}

P_τ{µ_n =ν} = |T_n(τ)|Q

a∈Στ(a)^nτ^(a)

|T_n(ν)|Q

a∈Στ(a)^nν(a)

= Y

a∈Σ

(nν(a))!

(nτ(a))!τ(a)n(τ(a)−ν(a))

≥ 1.

This last inequality can be seen as follows: the terms of the last product are of the forms ^m!_l! ¡_l

n

¢_l−m

. It is easy to check that ^m!_l! ≥l^m−l, therefore Y

a∈Σ

(nν(a))!

(nτ(a))!τ(a)^n(τ^(a)−ν(a))≥ Y

a∈Σ

nn(τ(a)−ν(a)) =nⁿ⁽^P^a∈Σ^τ(a)−^P^a∈Σ^ν(a))= 1.

It implies that

P_τ{µ_n=τ} ≥P_τ{µ_n =ν}

and thus

1 = X

ν

P_τ{µ_n =ν}

≤ |L_n|P_τ{µ_n =τ}

= |L_n||T_n(τ)|e^−nH(τ),

The theory of statistical decisions