,Member,IEEE AndrásAntos,LászlóGyörﬁ ,Fellow,IEEE ,andAndrásGyörgy IndividualConvergenceRatesinEmpiricalVectorQuantizerDesign (A15) (A14) (A13) (A12) (A11)

(1)

H( ^Xkj ^X^k01; YYY ) H( ^XkjX^k01; YYY )

= H(fk(X^k; Y^k01)jX^k01; YYY )

= H(fk(x^k01Xk; y^k01)jX^k01= x^k01; YYY = yyy)d(x^k01; yyy)

= H(fk(x^k01Xk; y^k01)jYk= yk)d(x^k01; yyy) (A11)

= H(fk(x^k01Xk; y^k01)jYk= yk)d(y¹_k jx^k01; y^k01) d(x^k01; y^k01)

= H(fk(x^k01Xk; y^k01)jYk= yk)d(yk) d(x^k01; y^k01) (A12)

= H(fk(x^k01Xk; y^k01)jYk) d(x^k01; y^k01)

Q₀₁(PX;Y; Ed(Xk; fk(x^k01Xk; y^k01)))d(x^k01; y^k01) (A13) Q₀₁ PX;Y; Ed(Xk; fk(x^k01Xk; y^k01))d(x^k01; y^k01) (A14)

= Q₀₁ PX;Y; E[d(Xk; fk(X^k; Y^k01))jX^k01= x^k01; Y^k01= y^k01]d(x^k01; y^k01)

= Q₀₁ PX;Y; Ed(Xk; ^Xk) (A15)

[9] , “A “follow the perturbed leader”-type algorithm for zero-delay quantization of individual sequences,” inProc. 2004 Data Compression Conf. (DCC’04), Snowbird, Utah, Mar. 2004.

[10] T. Linder and G. Lugosi, “A zero-delay sequential scheme for lossy coding of individual sequences,”IEEE Trans. Inf. Theory, vol. 47, no. 6, pp. 2533–2538, Sep. 2001.

[11] T. Linder and R. Zamir, “Causal source coding of stationary sources and individual sequences with high resolution,”IEEE Trans. Inf. Theory, to be published.

[12] S. Matloub and T. Weissman, “On competitive zero-delay joint source- channel coding,” inProc. 38th Annu. Conf. Information Sciences and Systems, Princeton, NJ, Mar. 2004, pp. 555–559.

[13] N. Merhav and I. Kontoyiannis, “Source coding exponents for zero- delay coding with ﬁnite memory,”IEEE Trans. Inf. Theory, vol. 49, no.

3, pp. 609–625, Mar. 2003.

[14] D. L. Neuhoff and R. K. Gilbert, “Causal source codes,”IEEE Trans.

Inf. Theory, vol. IT-28, no. 5, pp. 701–713, Sep. 1982.

[15] E. Sabbag, “Large deviations performance of zero–delay ﬁnite–memory lossy source codes and source–channel codes,” Master’s thesis, Tech- nion-I.I.T., Haifa, Israel, 2003.

[16] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,”IEEE Trans. Inf. Theory, vol. IT-19, no. 4, pp. 471–480, Jul.

1973.

[17] D. Teneketzis, “Optimal real-time encoding-decoding of Markov sources in noisy environments,” in Proc. Mathematical Theory of Networks and Systems (MTNS), Leuwen, Belgium, 2004.

[18] J. C. Walrand and P. Varaiya, “Optimal causal coding–decoding problems,”IEEE Trans. Inf. Theory, vol. IT–29, no. 6, pp. 814–820, Nov.

1983.

[19] T. Weissman and N. Merhav, “Finite-delay lossy coding and ﬁltering of individual sequences corrupted by noise,”IEEE Trans. Inf. Theory, vol.

48, no. 3, pp. 721–733, Mar. 2002.

[20] H. S. Witsenhausen, “On the structure of real–time source coders,”Bell Syst. Tech. J., vol. 58, no. 6, pp. 1437–1451, Jul./Aug. 1979.

Individual Convergence Rates in Empirical Vector Quantizer Design

András Antos, László Györﬁ, Fellow, IEEE, and András György, Member, IEEE

Abstract—We consider the rate of convergence of the expected distortion redundancy of empirically optimal vector quantizers. Earlier results show that the mean-squared distortion of an empirically optimal quantizer de- signed from independent and identically distributed (i.i.d.) source sam- ples converges uniformly to the optimum at a rate of (1 ), and that this rate is sharp in the minimax sense. We prove that for any ﬁxed dis- tribution supported on a given ﬁnite set the convergence rate is (1 ) (faster than the minimax lower bound), where the corresponding constant depends on the source distribution. For more general source distributions we provide conditions implying a little bit worse (log )rate of con- vergence. Although these conditions, in general, are hard to verify, we show that sources with continuous densities satisfying certain regularity proper- ties (similar to the ones of Pollard that were used to prove a central limit theorem for the code points of the empirically optimal quantizers) are in- cluded in the scope of this result. In particular, scalar distributions with strictly log-concave densities with bounded support (such as the truncated Gaussian distribution) satisfy these conditions.

Index Terms—Convergence rates, ﬁxed-rate quantization, empirical de- sign, individual convergence rate, log-concave densities.

Manuscript received October 2, 2003; revised July 28, 2005. This work was supported in part by the NATO Science Fellowship, a research grant from the Research Group for Informatics and Electronics of the Hungarian Academy of Sciences, and NKFP-2/0017/2002 project Data Riddle. The material in this correspondence was presented in part at the IEEE International Symposium on In- formation Theory, Chicago, IL, June/July 2004. Part of this work was performed while A. Antos and A. György were also with the Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada.

A. Antos and A. György are with the Informatics Laboratory, Computer and Automation Research Institute of the Hungarian Academy of Sciences, 1111 Budapest, Hungary (e-mail: antos@szit.bme.hu; gya@szit.bme.hu).

L. Györﬁ is with the Department of Computer Science and Information Theory, Budapest University of Technology and Economics, 1117 Budapest, Hungary (e-mail: gyorﬁ@szit.bme.hu).

Communicated by S. A. Savari, Associate Editor for Source Coding.

(2)

I. INTRODUCTION

The problem of empirical vector quantizer design is an important issue in data compression, since in many practical situations good source models are not available, but it is possible to collect source samples, called the training data, to gain information about the source statistics. Then the goal is to design a quantizer of a given rate, based on this data, whose average distortion on the source is as close to the distortion of the optimal quantizer (that is, one with minimum distortion) of the same rate as possible.

The usual, quite intuitive approach to this problem is empirical error minimization, which is based on the concept that if the training data de- scribes the source statistics accurately, then a quantizer that performs well on the training samples should also have a good performance on the real source. Most existing design algorithms employ this principle, and search for an empirically optimal quantizer, i.e., a quantizer min- imizing the empirical error on the training data, expecting that it will have near-optimal performance when applied to the real source. (The reader is referred to Gersho and Gray [8] for a good summary of such algorithms.) Indeed, under general conditions on the source distribution, Pollard [20], [21] showed that this method is consistent when the training data consists ofnconsecutive elements of a stationary and er- godic sequence drawn according to the source distribution: he proved that the mean-squared error (MSE) distortionD(Q³_n)of the empirically optimal quantizerQ³_n(when applied to the real source) converges with probability one to the minimum MSED³achieved by an optimal quantizer.

Obviously, the above consistency result does not provide any information on how many training samples are needed to ensure that the distortion of the empirically optimal quantizer is close to the optimum.

This question can be answered by analyzing the rate of convergence in D(Q³_n) ! D³, that is, by giving ﬁnite sample upper bounds for the distortion redundancyD(Q³_n)0 D³. Linderet al.[16] showed that the expected distortion redundancy (with respect to the training data) can be bounded asEEED(Q³n)0 D³ c=p

nfor some appropriate constant cfor all source distributions over a given bounded region. More pre- cisely, in [16], onlyO( log n=n)rate was shown, supported with a discussion on how to improve the convergence rate toO(1=p

n), but in the latter case the resulting constant was impractically large. A prac- tically applicable constant can be obtained by combining the results of [16] with recent results of Linder [14]. (See also [15] for a summary.) This result has been extended in many ways. An extension to vector quantizers designed for noisy channels or for “noisy” sources was given by Linderet al.[17], an extension to unbounded sources was provided by Merhav and Ziv [19], while the case of dependent (mixing) training data was examined by Zeevi [24].

Bartlettet al.[3] showed that theO(1=p

n)bound on the expected distortion redundancy is tight in the minimax sense. They proved that (for at least three quantization levels) for any empirical quantizer design method, that is, when the resulting quantizerQnis an arbitrary function of the training data, and for anynlarge enough, there is a distribution in the class of distributions over a bounded region such thatEEED(Qn) 0 D³ > c=p

n. These “bad” distributions are quite simple, e.g., the distributions used in the proof are concentrated on finitely many atoms. However, the minimax lower bound gives information about the maximum distortion within the class, but not about the behavior of the distortion for a single fixed source distribution, as the sample sizenincreases. Moreover, the chosen “bad” distributions in the proof of the above result are different for alln, allowing the possibility that the upper bound may be improved in an individual sense, that is, a faster rate of convergence may be achievable, where the constant in the bound also depends on the (fixed) source distribution. Finding the best such individual rate (or weak rate) for the class of sources over

a bounded region was labeled in [3] as an interesting and challenging problem.

There are some results suggesting that the convergence rate can be improved toO(1=n): In the special case of a one-level quantizer, the code point of the empirically (MSE) optimal quantizer is simply the average of the training samples, and it is easy to see that in this case EE

ED(Q³n) 0 D³ = c=n, wherecis the variance of the source. Also, based on another result of Pollard [22] showing that for sources with continuous densities satisfying certain regularity properties the suitably scaled difference of the code points of the optimal and the empirically optimal quantizers has asymptotically multidimensional normal distribution, Chou [4] pointed out that for such sources the distortion redundancy decreases asO(1=n)in probability.

In this correspondence, we provide improved upper bounds for the convergence rate of the expected distortion redundancy individually for source distributions within the class of distributions over a bounded region. In Theorem 1, we show thatEEED(Q³n) 0 D³ c(; N)=nfor all source distributionsconcentrated on a ﬁnite set, where the constantc(; N)depends on the actual source distribution and the number of quantization levelsN. The convergence rate for general source distributions is considered in Theorem 2. It is shown that for source distributions over a bounded region satisfying a certain regularity condition, the expected distortion redundancy can be upper-bounded by c(; N) log n=n, where the actual value of the constant again depends on the actual source distribution. In Corollary 1, we prove that source distributions with bounded support satisfying essentially the same conditions as in [22] satisfy the requirements of Theorem 2, and in Corol- lary 2 we show that the conditions of Corollary 1 hold for scalar sources having strictly log-concave densities with bounded support (such as the truncated Gaussian distribution), and for the uniform distribution, im- plyingO(log n=n)expected distortion redundancy. To give more in- sight to the problem we also illustrate that similar conditions of Har- tigan [10] (that are valid only in one dimension) also imply the condition in Theorem 2.

Comparing our results with [3], it follows that the problem of empirical quantizer design is an interesting example of the unusual event when the orders of the minimax lower bound and the individual upper bound are different.

II. EMPIRICALVECTORQUANTIZERDESIGN

Ad-dimensionalN-levelvector quantizeris a measurable mapping Q : ^d ! C, where thecodebookC = fy1; . . . ; yNg ^dis a collection ofNdistinctd-vectors, called thecode points. The quantizer is completely characterized by its codebook and the sets

Si= fx 2 ^d: Q(x) = yig; i = 1; . . . ; N

called thecellsorpartition cells(as they form a partition of ^d) via the rule

Q(x) = yi; ifx 2 Si:

The setfS1; . . . ; SNgis called the partition ofQ. Throughout this correspondence, unless explicitly stated otherwise, all quantizers are assumed tobed-dimensional withN code points.

The source to be quantized is a random vectorX 2 ^dwith distribution. We assumeEEEfkXk²g < 1, wherek 1 kdenotes the Euclidean norm. The performance of the quantizerQin quantizingXis measured by theaverage (mean-squared) distortion

D(Q) = EEEfkX 0 Q(X)k²g:

(3)

A quantizerQ³achieving the minimum distortionD³is calledoptimal.

Thus, in this case

D³= D(Q³) D(Q); for allQ 2 QN

whereQNdenotes the set of allN-level quantizers. It is well known (see, e.g., [13], [8]) that any optimal quantizer satisﬁes the centroid and nearest neighbor conditions, also known as the Lloyd–Max conditions.

The quantizerQsatisﬁes the centroid condition if each code point is chosen to minimize the distortion over its associated cell, that is,

EEEfkX 0 yik²j X 2 Sig = min

y EEEfkX 0 yk j X 2 Sig (1) and so

yi= EEEfXjX 2 Sig

for alli = 1; . . . ; N. A partitionfS1; . . . ; SNgis optimal if the quan- tizerQwith cellsS1; . . . ; SN satisfying the centroid condition is optimal.Qis called anearest neighborquantizer, if it satisﬁes

kx 0 Q(x)k = min

i kx 0 yik; for allx 2 ^d: (2) Note that

i) a nearest neighbor quantizer is determined by its codebook fy1; . . . ; yNgwith ties arbitrarily broken;

ii) for any nonnearest neighbor quantizerQ⁰, a nearest neighbor quantizerQwith the same codebook has at most the same distortion asQ⁰, that is,D(Q) D(Q⁰), regardless of the distribution ofX.

Thus, any optimal quantizer can be assumed to be nearest neighbor, and so ﬁnding an optimal quantizer is equivalent to ﬁnding its codebook. Using this observation, Pollard [21] proved that ifEEfkXkE ²g <

1, then there exists an optimal quantizer (which may not be unique).

In many situations, the distributionis unknown, and the only available information about it is given in the form oftraining data, that is, a sequenceX₁ⁿ = X1; . . . ; Xnofnindependent and identically distributed (i.i.d.) copies ofX. The sequenceX1ⁿis alsoassumed tobe independent ofX.X₁ⁿis used toconstruct anempirically designed quantizerQn( 1 ) = Qn(1; X1; . . . ; Xn), which is a random function depending on the training data. The goal is to produce such quantizers with performance nearD³. The performance ofQnin quantizingXis measured by thetest distortion

D(Qn) = EEEfkX 0 Qn(X)k²jX1ⁿg = kx 0 Qn(x)k²(dx):

Note thatD(Qn)is a random variable.

Theempirical distortion(or training distortion) of anyQis given by its MSE in quantizing the training data

Dn(Q) = 1n

n i=1

kXi0 Q(Xi)k²:

Note that althoughQis a deterministic mapping, the empirical dis- tortionDn(Q)is also a random variable depending on the training dataX1ⁿ.

Assume thatQ_n³minimizes the empirical distortion, that is, Dn(Q³_n) = min

Q2Q Dn(Q):

ThenQ³_n(which is a speciﬁc empirically designed quantizer) is called anempirically optimal vector quantizer.Q³_nis an optimal quantizer for the empirical distributionnof the training data given as

n(A) = 1n

n i=1

IfX 2Ag

for every Borel setA ^d, whereIEdenotes the indicator function of the eventE. Note thatQ³nalways exists (although it is not neces- sarily unique), and we can assume that it is a nearest neighbor quantizer.

UsingQ³nas an approximation of the optimalQ³is consistent in the sense that its test distortion converges to the optimal distortion, that is,

n!1limD(Q³n) = D³

almost surely for anyN 1ifEfkXk²g < 1, see [20], [21].

III. RATE OFCONVERGENCE

Todetermine the number of training samples necessary toachieve a preassigned level of distortion, the ﬁnite sample behavior of theex- pected distortion redundancy

E

EED(Q³n) 0 D³

has tobe analyzed. Todothis we assume the peak power constraint

PPP fkXk Bg = 1 (3)

and this assumption will be in effect throughout the correspondence. In other words, the distributionof the sourceXis an element ofP(B), the family of distributions supported on the sphere

S(B) = fx 2 ^d: kxk Bg:

An important consequence of (3) is that it is sufﬁcient for our purpose to consider only quantizers with code points in the sphereS(B), since otherwise projecting a code point that is not inS(B)tothe surface of S(B)clearly reduces the distortion.

It is of interest how fastEEED(Q³_n)converges toD³. Toour knowl- edge, the best accessible upper bound can be obtained by combining results of [16] with recent developments of [14], implying

0 sup

2P(B)(EEED(Q³n) 0 D³) cuB² Nd

n (4)

for alln 1, wherecu = 192. A natural question is whether there exists a method, perhaps different from empirical distortion minimization, which provides an empirically designed quantizer with substantially smaller test distortion. In case ofN = 1, it is easy tosee that

EEED(Q³n) 0 D³= Var(X)n :

Thus, the convergence rate isO(1=n), substantially faster than the O(1=p

n)rate above. However, forN 3, the lower bound in [3]

shows that theO(1=p

n)convergence rate above cannot be improved in the minimax sense: There it is proved that ifN 3, then for any empirically designed quantizerQntrained onn n0 10⁴N samples, we have

2P(B)sup (EED(QE n) 0 D³) clB² N^104=d

n (5)

(4)

wherecl 2:67110⁰¹¹. This result has recently been improved in [1], extending the results tothe caseN = 2, and improving the constantcl

toapproximately1:68 1 10⁰⁴and the constantn0to8N.

The results (4) and (5) imply that there exist positive constants cl(N; B; d)andcu(N; B; d)depending onN; B;anddsuch that

cl(N; B; d)p n inf

Q sup

2P(B)(EEED(Qn) 0 D³) c^u(N; B; d)p n where the inﬁmum is taken over all empirically designed quantizers Qn(recall thatQn = Qn(1; X1; . . . ; Xn)is a functionQn : ^d2

dn! ^d). That is, theminimax boundson the rate of convergence are 2(1=pn). The minimax lower bound expresses the minimum achievable worst case error which is achievable for a given sample sizenfor the distribution classP(B), by describing the behavior of any quantizer design method for a source distribution which is the least suitable for the givennand the given design method.

The “bad” distributions achieving the supremum of the expected distortion redundancy in (5) may be different for eachn. Indeed, in the construction of [3] (or [1]), although the bad distributions are concentrated on the same finitely many atoms for eachn, the exact probability mass function of the bad distribution depends onn. Thus, the bound does not tell the behavior of the distortion redundancy for asingle fixed source distribution. For example, it does not exclude the possibility that for some sequence of empirical quantizersfQng,EEED(Qn)0 D³ converges to0atO(1=n)rate forevery fixed, that is, it may be possible toget fasterindividual upper boundsof the form

EEED(Qn) 0 D³ c(; N)

n (6)

for each 2 P(B)andn 1, where the constantc(; N)depends on the source distribution. This type of upper bounds is the main purpose of this correspondence. Next we show (6) for discrete distributions, and with an additional factorlog nfor general distributions satisfying some regularity condition. The proofs are deferred to the next section.

Our first result shows that the expected distortion redundancy converges to0at a rateO(1=n)for any fixed source distribution concentrated on a finite set of points in ^d.

Theorem 1: Assume that the source distributionis concentrated on ﬁnitely many atoms. Then

EEED(Q³n) 0 D³ c(; N)n where the constantc(; N)depends onandN.

The main idea in the proof is that with high probability, the empirical and the real source distributions are so close that the corresponding optimal quantizer partitions coincide, and in this case we only need to ﬁnd the centroid of these partitions. Proving similar rate of convergence results for more general source distributions is signiﬁcantly harder, since the partitions of optimal quantizers for “close” distributions are different in general.

Next we give conditions on the source distribution which ensure that the expected distortion redundancy converges to 0at rate O(log n=n). For a nearest neighbor quantizerQlet

1Q(x) = kx 0 Q(x)k²0 kx 0 QQ³(x)k²

for allx 2 S(B), whereQ³Qis the “closest” optimal nearest neighbor quantizer toQin the sense that it achieves the minimum

^ min

Q:D( ^Q)=D VarfkX 0 Q(X)k²0 kX 0 ^Q(X)k²g: (7)

If the minimizingQ^is not unique,Q³_Qcan be chosen arbitrarily from among the optimal nearest neighbor quantizers realizing the above minimum. Note that the minimum can always be achieved as can be seen by a continuity-compactness type argument. This minimization is in- troduced to avoid problems that occur if the optimal quantizerQ³is not unique.

Theorem 2: Assume thatPPP fkXk Bg = 1, and letQ³_nbe an empirically optimal quantizer. Assume furthermore that

A = inf

Q:D(Q)>D

EE

Ef1Q(X)g

Varf1Q(X)g> 0 (8) where the inﬁmum is taken over all nonoptimal nearest neighbor quantizers having all their code points in the sphereS(B). Then

EE

ED(Qn³) 0 D³ c¹log n

n + cn² (9)

with constants

c1= 4dN max e 0 2A ; 4B² and

c2= 4N max e 0 2A ; 4B² log V 3eB Np

d max ^e02_A ; 4B²

d

wherelogdenotes the natural logarithm andV denotes the volume of the sphereS(B).

The proof of the theorem is based on a proof of [17] combined with a technique developed by Barron [2] and Leeet al.[12] (see also, e.g., [9, Ch. 16]). The essence of the latter, which is an interesting result itself, is formulated in Lemma 1 in the next section.

Remark 1: It is expected that at the expense of a more complicated analysis, the log n term can be removed from the upper bound (9), giving the desiredO(1=n)rate.

Remark 2: The constants in the preceding theorem can slightly be improved to

c1= 16dNBG⁰¹(4AB²²) and

c2= 16NB²

G⁰¹(4AB²)log V 3eG⁰¹(4AB²) 4BNp

d

whereG⁰¹is the inverse of the functionGgiven as G(c) = e^c0 c 0 1

c ; forc > 0: (10) This is shown at the end of the proof of the theorem.

Condition (8) is hard to check for general source distributions, therefore, the scope of Theorem 2 is not clear. The next corollary shows that the theorem is valid for sources with continuous densities satisfying certain regularity properties.

Letfy³₁; . . . ; y_N³gbe the code points of an optimal quantizer for and letfS³1; . . . ; S_N³gdenote the corresponding nearest neighbor cells. It is known (see [22, Lemma C and Theorem]) that if has a continuous density f with bounded support, then the distortion D(y1; . . . ; yN) = D(Q)of the nearest neighbor quantizerQwith

(5)

code points fy1; . . . ; yNg is a continuous function of the vector (y1; . . . ; yN)which has a second derivative block matrix

0(y³₁; . . . ; y_N³) = [0ij(y³₁; . . . ; y_N³)]

at(y₁³; . . . ; y³_N)made up ofd2dblocks (see the equation at the bottom of the page) whereFijis the (possibly empty) common face ofS³_i and Sj³(it is a convex set in a(d 0 1)-dimensional hyperplane),Idis the d 2 didentity matrix, andd01is the(d 0 1)-dimensional Lebesgue measure.¹It is clear that sincefy1³; . . . ; y³Ngis an optimal codebook, the matrix0(y³₁; . . . ; y_N³)is positive semideﬁnite. The next corollary shows that if0(y³1; . . . ; yN³)is alsopositive deﬁnite, then the desired O(log n=n)convergence rate can be established.

Corollary 1: Assume that the random variableXhas a continuous density supported inS(B), and the matrix0(y³1; . . . ; y_N³)is positive deﬁnite for all optimal codebooks. Then

EEED(Q³n) 0 D³= O log nn :

The conditions of Corollary 1 on the distribution are essentially the same as those of Pollard [22] (and of Chou [4]). The conditions in [22]

are weaker in the sense that there the usual assumption ofXhaving a bounded support is replaced by a tail condition. This extension might be possible for Corollary 1 and Theorem 2 at the expense of some com- plication in the proof. On the other hand, while Pollard assumes the uniqueness of an optimal quantizer, we allow multiple optima. How- ever, if the set of optimal quantizers (each represented by theN-vector of its codebook) has an accumulation point, then usually0is not positive deﬁnite for all optimal codebooks. Thus, Corollary 1 is not applicable, for example, for a multidimensional truncated Gaussian distribution (although we suspect that the results can be sharpened to in- clude such cases, as well). Nevertheless, there are cases when the optimal quantizer is not unique and the set of optimal codebooks does not have an accumulation point. For example, for scalar sources with symmetric, not log-concave densities with a large spike in the middle usually two asymmetric optimal two-level quantizers exist (if the density is log-concave, then the optimal quantizer is unique [7], [23]).

Note that Corollary 1 implies Chou’s result for bounded distributions with the additionallog n factor. However, the other direction would require some kind of uniform integrability of the random variables fn(D(Qn³) 0 D³)gn1, which can be arbitrary large asngoes to in- ﬁnity, even for bounded source distributions.

Although it is not easy to determine in general whether or not the matrix0(y₁³; . . . ; y³_N)is positive definite, sufficient conditions can be obtained easily in the scalar case. For example, it is easy to show that 0(y³₁; . . . ; y³_N)is positive definite for the uniform distribution, where the unique optimal quantizer is theN-level uniform quantizer. More general, sufficient conditions can be given based on a result of Fleis- cher [7], who, while proving the uniqueness of an optimal quantizer, alsoshowed that if the source has a densityf for which the deriva- tived²log f(x)=dx²is negative over its total support, then the matrix 0(y³1; . . . ; y³N)is positive definite for all optimal codebooks, and hence

1Note that Pollard [22] made a slight error in the derivation of0ijfori 6= j, and arrived to a formula with an incorrect sign.

for the unique optimal codebook. A slight modiﬁcation of Fleischer’s original proof allows us to replace the condition on the second derivative by the assumption thatlog f(x)is a strictly concave function (then f is called a strictly log-concave function). Thus, we obtain the following result.

Corollary 2: Assume that the scalar random variable X has a strictly log-concave densityfsupported in the interval[0B; B](that is, log f(x)is strictly concave over its support), orX is uniformly distributed in[0B; B]. Then

E

EED(Q³n) 0 D³= O log n

n :

We note here that the result of Fleischer proving that the matrix 0(y₁³; . . . ; y³_N) is positive deﬁnite proves that scalar sources with strictly log-concave densities and sufﬁciently light tails (such as one with Gaussian distribution) satisfy the conditions of Pollard’s central limit theorem [22] (this fact escaped Pollard’s attention), which also implies that for such sources the distortion redundancy isO(1=n)in probability by [4].

While Corollary 1 uses the nearest neighbor condition to capture optimality of quantizers, another approach is to use the centroid condition instead. This approach was used by Hartigan [10], a precursor of [22]

for one dimension, who applied differentiation with respect to the quantization thresholds instead of differentiation with respect to the code points. This method is illustrated in the next example for the scalar case whenN = 2.

Example 1: Ford = 1andN = 2, deﬁne the split function D(t) = VarfX j X < tgPPP fX < tg + VarfX j X tgPPPfX tg that is, the minimal distortion corresponding to the partition f(01; t); [t; 1)g, and lett³= (y1³+ y2³)=2denote the cell boundary (or threshold) of an optimal quantizer with codebookfy³₁; y³₂g. Then if the scalar random variableXhas a continuous density in the interval [0B; B], and ^{d D}_dt (t³)is positive for any optimal thresholdt³, then

E

EED(Q³n) 0 D³= O log nn : Tosee this, rewriteD(t)as

D(t) = EEEfX²g 0 EEE²fX j X < tgPPP fX < tg

0EEE²fX j X > tgPPP fX tg:

Then d²D

dt² (t³) = f(t³)(y₂³0 y³₁) 2 0 f(t³)(y₂³0 y³₁) 2PPP fX < t³gPPP fX t³g and the assumption ^{d D}_dt (t³) > 0implies that

201;1(y³₁; y³₂) = 4PPP fX < t³g 0 f(t³)(y₂³0 y³₁)

4PPP fX < t³gPPP fX t³g 0 f(t³)(y2³0 y³1)

= det 0(y1³; y³2)

> 0

and thus0(y₁³; y³₂)is positive deﬁnite for any optimal codebook, thus the result follows by Corollary 1.

0ij(y₁³; . . . ; y³_N) =

2(Si³)Id0 2

l6=i

f(x)(x0y )(x0y ) d (x)

ky 0y k ; forj = i 2 f(x)(x0y )(x0y ) d (x)

ky 0y k ; forj 6= i

(1 i; j N)

(6)

Finally, a comparison of the results in this section with [3] shows that the problem of empirical quantizer design is an interesting example of the unusual event when the orders of the minimax lower bound and the individual upper bound are different (the latter being smaller).

IV. PROOFS

Proof of Theorem 1: LetS = fx1; . . . ; xmg ^ddenote the support ofwith corresponding probability masspi= PPP fX = xig >

0; i = 1; . . . ; m. Assumem N + 1, since otherwise E

EED(Q³n) 4B² ^m

i=1

(1 0 pi)ⁿ by the power constraint (3).

Let 5 be the set of optimal partitions for . Let P_n³ = fS_n;1³ ; . . . ; S_n;N³ g denote the partition of an empirically optimal quantizerQ³n. Clearly,Pn³ 2 5 . By the centroid rule (1) the code pointy³_n;iassociated with the cellS_n;i³ is the average of samples falling intothis cell. This can be computed whenevern(Sn;i³ ), the ratioof samples falling into the cell is positive. Without loss of generality, we can assume that otherwiseSn;i³ = ;in which case the deﬁnition of y³_n;iis immaterial, since keeping only the code points corresponding to the nonempty cells can only increase the test distortion ofQ³_n, and hence increase the expected distortion redundancy.

Decompose the expected distortion redundancy ofQ³_n in the following way:

EEED(Q³n) 0 D³= EE IE fP 25 g(D(Q_n³) 0 D³)

+ EEE IfP =25 g(D(Q_n³) 0 D³) : (11) Letyn;i³ = EEEfX j X 2 Sn;i³ gfor alli; no w ifPn³ 2 5then the quantizer with partitionP_n³and codebookfy³_n;1; . . . ; y³_n;Ngis optimal for, and so for the ﬁrst term of (11) we have

E

EE IfP 25 g(D(Q³_n)0D³)

=EEE IfP 25 g N i=1

(y³n;i0 yn;i³ )²(Sn;i³ )

=EEE IfP 25 g N i=1

If (S )>0g(y³n;i0y³n;i)²(Sn;i³ ) : (12) For any partitionP = fS1; . . . ; SNg 2 5, letyn;ibe the average of samples falling intoSi(it can be deﬁned arbitrarily ifn(Si) = 0) and letyi= EEEfX j X 2 Sig. Then (12) can be continued as

EEE IfP 25 g N i=1

If (S )>0g(y_n;i³ 0 y³_n;i)²(S_n;i³ )

EEE

P :P 25 N i=1

If (S )>0g(yn;i0 yi)²(Si)

=

P :P 25 N i=1

EE

E If (S )>0g

2 EEE f(yn;i0 yi)² IfX 2S g; . . . ; IfX 2S g (Si)

=

P :P 25 N i=1

EE

E If (S )>0gVarfXjX 2 Sig nn(Si) (Si)

=

P :P 25 N i=1

EE

E IfB >0g

Bi VarfXjX 2 Sig(Si)

P :P 25 N i=1

2 VarfXjX 2 Sig

n + 1 (13)

2B²j 5jN

n + 1 (14)

whereBihas a binomial distribution with parametersnand(Si), (13) holds by Devroyeet al.([6, Lemma A.2]), andj5jdenotes the cardinality of the set5.

We also need to give a bound on the second term of (11). Pollard [21] proved that the set of optimal partitions is continuous with respect totheL2 Wasserstein distance of the distributions (deﬁned by W(; ⁰) = inf(X;Y )(EEkX 0 Y kE ²)¹⁼², where the inﬁmum is taken over all joint distributions of(X; Y )such thatXandY have distribu- tionsand⁰, respectively), where a sequence of sets5kof partitions converges to5if the corresponding characteristic functions converge, that is,I_{fP 25 g} ! I_{fP 25g} ask ! 1for each partitionP ofS. Moreover, it follows from [18] that theL2metric

(; ⁰) =

m i=1

((xi) 0 ⁰(xi))²

1=2

for the family of distributions concentrated on S is as strong as the Wasserstein distance, that is, (n; ) ! 0 if and only if W(n; ) ! 0. Therefore, if(n; ) ! 0for the sequencefng of the empirical distributions, then5 ! 5. Since the number of possible partitions ofS is ﬁnite, this implies that there is a > 0 such that

5 = 5; for(n; ) < : SinceP_n³2 5 , this and the power constraint (3) imply

EE

EfIfP =25 g(D(Qn³) 0 D³)g 4B²PPP fPn³ =2 5g 4B²PPP f(n; ) g:

Applying Markov’s inequality we obtain PPP f(n; ) g = PPP ^m

i=1

(n(xi) 0 pi)² ² EEEf ^m_i=1(n(xi) 0 pi)²g

²

= ^mⁱ⁼¹pi(1 0 pi) n²

1n²: Thus, from (11), (12), and (14) we obtain

E

EED(Q³n) 0 D³ 2B² j5jN

n + 1 + 2n² c(; N)n where

c(; N) = 2B²(j5jN + 2=²):

(Note thatj5jis bounded above by some function ofNandm.) The proof of Theorem 2 is based on the following lemma.

Lemma 1: LetXij; i = 1; . . . ; n; j = 1; . . . N be random variables such that for each ﬁxedj; X1j; . . . ; Xnj are i.i.d. such that for eachs0 s > 0

EE

Efe^sX g e^s : Forj > 0, put

= min

jN

j

²_j: Then

EE E max

jN

1 n

n i=1

Xij0 j log N

minf; s0gn: (15) If

E

EEfXijg = 0

(7)

and

jXijj K then, for anyL > 0

EE E max

jN

1 n

n i=1

Xij0 j K log N

min K³_{e 010L}^L ; L n (16) where

³= min

jN

j

Var(Xij): Proof: For the notation

Yj= 1n

n i=1

Xij0 j

we have that for anys0 s > 0 EE

Efe^snY g = EEE e^sn( ^{X 0 )}

= e^0sn (EEEfe^sX g)ⁿ e^0sn e^ns e^{0sn +s n} : Thus,

e^snE^E^Efmax ^{Y g} EEE e^{sn max} ^Y

= EEE max

jNe^snY

jN

EE Efe^snY g

jN

e^{0sn (0s)}:

Fors = minf; s0git implies that EEE max

jNYj 1snlog

jN

e^{0sn (0s)} log N minf; s0gn: In order to prove the second half of the lemma, notice that, for any L > 0andjxj Lwe have the inequality

e^x= 1 + x + x² ¹

i=2

xⁱ⁰² i!

1 + x + x²

1 i=2

Lⁱ⁰² i!

= 1 + x + x²e^L0 1 0 L L²

therefore,0 < s s0= L=Kimplies thatsjXijj L, so e^sX 1 + sXij+ (sXij)²e^L0 1 0 L

L² : Thus,

EEEfe^sX g 1 + s²Var (Xij)e^L0 1 0 L

L² e^{s Var(X )} so (16) follows from (15).

Remark 3: Equation (16) implies different bounds for different choice ofL. IfL = 1then

K log N

minfK³_{e 010L}^L ; Lgn= K

minfK³=(e 0 2); 1glog N n

= max e 0 2³ ; K log Nn : Hamers and Kohler [11] derived a similar bound

1

2³+ 2K3 log N n : There is a better choice ofL. Introduce the function

G(L) = e^L0 L 0 1

L ; forL > 0 and chooseLsuch that

K³ L²

e^L0 1 0 L = L i.e.,

K³= G(L) i.e.,

L = G⁰¹(K³) then (16) implies the bound

K log N

G⁰¹(K³)n: (17)

Remark 4: In its spirit, Lemma 1 is similar tothe inequality of De- vroye and Lugosi [5] with the essential difference that they considered the maximum of zero mean random variables

EE E max

jN

1 n

n i=1

Xij 2 max

jNj log N n :

Proof of Theorem 2: Following the proof of [17] consider a cubic grid of widthinS(B)with the minimum number of grid points. It can be seen that if the origin of the grid is uniformly distributed in thed-dimensional cube with edge length centered at the origin of the sphere, then the expected number of grid points inside the ball is V=^d. Thus, the grid inS(B)with minimum number of points has at mostV=^dpoints. LetQ⁰_Ndenote the set ofN-level nearest neighbor quantizers which have all their code points on this grid. Since for any y 2 S(B)there is any⁰on the grid such thatky 0 y⁰k p

d, fo r any quantizerQwith code points insideS(B)there is aQ⁰ 2 Q⁰_N such that

x2S(B)sup jkx 0 Q(x)k²0 kx 0 Q⁰(x)k²j 4Bp d:

Letting = 4Bp

d, speciﬁcally there is a quantizerQ⁰n2 Q⁰_Nsatis- fying

x2S(B)sup jkx 0 Q³n(x)k²0 kx 0 Q⁰n(x)k²j

(8)

with probability 1, since the code points of an empirically optimal quantizer are almost surely concentrated inS(B). Concerning the cardinality ofQ⁰_N we have

jQ⁰_Nj V^d

N

= V^N(4Bp

d)^dN^0dN: (18)

ForQ = Q_n⁰, letQndenote the optimal quantizer achieving the minimum in (7). Using the empirical optimality ofQ³nwe proceed with the following decomposition:

EE

EfkX 0 Q³_n(X)k²jX₁ⁿg 0 D³

EEEfkX 0 Q³_n(X)k²jX₁ⁿg 0 EEEfkX 0 Qn(X)k²jX₁ⁿg 0 2n

n i=1

(kXi0 Q³n(Xi)k²0 kXi0 Qn(Xi)k²)

3 + EEfkX 0 QE ⁰n(X)k²jX₁ⁿg 0 EEfkX 0 QE n(X)k²jX₁ⁿg 0 2n

n i=1

(kXi0 Q⁰_n(Xi)k²0 kXi0 Qn(Xi)k²) (19)

with probability1. Then EE

EfkX 0 Q³n(X)k²g 0 D³

EEEfEEEfkX 0 Q⁰n(X)k²jX1ⁿg 0 EEEfkX 0 Qn(X)k²jX1ⁿg 0 2n

n i=1

(kXi0 Q⁰_n(Xi)k²0 kXi0 Qn(Xi)k²)g + 3

= EEE EEf1E Q (X)jX₁ⁿg 0 2 n

n i=1

1_Q (Xi) + 3

EEE max

Q2Q EEEf1Q(X)g 0 2n

n i=1

1Q(Xi) + 3

= EEE max

Q2Q

1 n

n i=1

2(EEf1E Q(X)g 0 1Q(Xi))

0 EEf1E Q(X)g + 3 4 max e 0 2A ; 4B² log jQ⁰_Nj

n + 3 (20)

4 max e 0 2A ; 4B² log(V^N(4Bp

d)^dN^0dN)

n + 3 (21)

where (20) follows from Lemma 1 withL = 1and Xi;Q= 2(EEf1E Q(X)g 0 1Q(Xi)) andQ= EEf1E Q(X)g; and (21) follows form (18). Choosing

= 4 max e 0 2A ; 4B² dN 3n completes the proof of the theorem.

The improved constants of Remark 2 can be obtained if one uses the improved version of Lemma 1 with bound (17) instead of withL = 1 in (20).

Proof of Corollary 1: We only need to show that condition (8) is satisﬁed under the assumptions of the corollary; then the result follows by Theorem 2. It is easy to see that if (8) does not hold, then there is a sequence of strictly suboptimal quantizersQn2 QNconverging to an optimal quantizerQ³in the sense thatyn;i! y³_i for alli, where

fyn;1; . . . ; yn;Ngdenotes the codebook ofQnandfy₁³; . . . ; y³_Ngde- notes the codebook ofQ³, such that

n!1lim EE

Ef1Q (X)g

Varf1Q (X)g= 0: (22) In what follows, we will show that

lim inf

n!1

EEEfkX 0 Qn(X)k²0 kX 0 Q³(X)k²g

VarfkX 0 Qn(X)k²0 kX 0 Q³(X)k²g> 0 (23) which readily implies that (22) cannot hold, proving the corollary.

Since

D(y1; . . . ; yN) = EEEfkX 0 Q(X)k²g

is twice differentiable according to the vectoryyy =(y1; . . . ; yN)2 ^dN in a neighborhood of yyy³ = (y₁³; . . . ; y³_N), where Q is a nearest neighbor quantizer with codebookfy1; . . . ; yNg[22],D(y1; . . . ; yN) has the following Taylor expansion:

D(y1; . . . ; yN) = D(y1³; . . . ; y³N) + dD(yyy³)

dyyy (yyy 0 yyy³)

+ 12(yyy 0 yyy³)^T0(yyy³)(yyy 0 yyy³) + o(kyyy 0 yyy³k²) wheredD(yyy³)=dyyydenotes the vector formed by the partial derivatives ofDaccording to its variables atyyy³. It is alsoshown in [22] that the derivativedD(yyy)=dyyyis made up of thed-vectors

@D(y1; . . . ; yN)=@yi= 02EEfIE fX2S g(X 0 yi)g:

However, sinceyyy³minimizesD(yyy); dD(yyy³)=dyyy = 0(here0denotes the zerovector), and so

EEEfkX 0 Q(X)k²0 kX 0 Q³(X)k²g

= D(y1; . . . ; yN) 0 D(y³1; . . . ; yN³)

= 12(yyy 0 yyy³)^T0(yyy³)(yyy 0 yyy³) + o(kyyy 0 yyy³k²):

Furthermore, since 0(yyy³) is positive deﬁnite by assumption, its smallest eigenvalueis positive, and for anyaaa 2 ^dN we have

aaa^T0(yyy³)aaa kaaak² which in turn implies that

EEEfkX 0 Q(X)k²0 kX 0 Q³(X)k²g

2kyyy 0 yyy³k²+ o(kyyy 0 yyy³k²): (24) Next we bound the varianceVarfkX 0 Q(X)k²0 kX 0 Q³(X)k²g. Notice thatkx 0 Q(x)k²can be decomposed as

kx 0 Q(x)k²= ^N

i=1

kx 0 yik²I_{fx2S g}

+ ^N

i=1 1jN;j6=i

(kx 0 yjk²0 kx 0 yik²)I_{fx2S \S g} where, as usual, Si denotes the cell ofQcorresponding to the code pointyi. Therefore,

kx 0 Q(x)k²0 kx 0 Q³(x)k²

= ^N

i=1

(kx 0 yik²0 kx 0 y³_ik²)I_{fx2S g}

+ ^N

i=1 1jN;j6=i

(kx 0 yjk²0 kx 0 yik²)Ifx2S \S g

=

N i=1

Ji(x) +

N

i=1 1jN;j6=i

Ki;j(x)