• Nem Talált Eredményt

,Member,IEEE AndrásAntos,LászlóGyörfi ,Fellow,IEEE ,andAndrásGyörgy IndividualConvergenceRatesinEmpiricalVectorQuantizerDesign (A15) (A14) (A13) (A12) (A11)

N/A
N/A
Protected

Academic year: 2022

Ossza meg ",Member,IEEE AndrásAntos,LászlóGyörfi ,Fellow,IEEE ,andAndrásGyörgy IndividualConvergenceRatesinEmpiricalVectorQuantizerDesign (A15) (A14) (A13) (A12) (A11)"

Copied!
10
0
0

Teljes szövegt

(1)

H( ^Xkj ^Xk01; YYY ) H( ^XkjXk01; YYY )

= H(fk(Xk; Yk01)jXk01; YYY )

= H(fk(xk01Xk; yk01)jXk01= xk01; YYY = yyy)d(xk01; yyy)

= H(fk(xk01Xk; yk01)jYk= yk)d(xk01; yyy) (A11)

= H(fk(xk01Xk; yk01)jYk= yk)d(y1k jxk01; yk01) d(xk01; yk01)

= H(fk(xk01Xk; yk01)jYk= yk)d(yk) d(xk01; yk01) (A12)

= H(fk(xk01Xk; yk01)jYk) d(xk01; yk01)

Q01(PX;Y; Ed(Xk; fk(xk01Xk; yk01)))d(xk01; yk01) (A13) Q01 PX;Y; Ed(Xk; fk(xk01Xk; yk01))d(xk01; yk01) (A14)

= Q01 PX;Y; E[d(Xk; fk(Xk; Yk01))jXk01= xk01; Yk01= yk01]d(xk01; yk01)

= Q01 PX;Y; Ed(Xk; ^Xk) (A15)

[9] , “A “follow the perturbed leader”-type algorithm for zero-delay quantization of individual sequences,” inProc. 2004 Data Compression Conf. (DCC’04), Snowbird, Utah, Mar. 2004.

[10] T. Linder and G. Lugosi, “A zero-delay sequential scheme for lossy coding of individual sequences,”IEEE Trans. Inf. Theory, vol. 47, no. 6, pp. 2533–2538, Sep. 2001.

[11] T. Linder and R. Zamir, “Causal source coding of stationary sources and individual sequences with high resolution,”IEEE Trans. Inf. Theory, to be published.

[12] S. Matloub and T. Weissman, “On competitive zero-delay joint source- channel coding,” inProc. 38th Annu. Conf. Information Sciences and Systems, Princeton, NJ, Mar. 2004, pp. 555–559.

[13] N. Merhav and I. Kontoyiannis, “Source coding exponents for zero- delay coding with finite memory,”IEEE Trans. Inf. Theory, vol. 49, no.

3, pp. 609–625, Mar. 2003.

[14] D. L. Neuhoff and R. K. Gilbert, “Causal source codes,”IEEE Trans.

Inf. Theory, vol. IT-28, no. 5, pp. 701–713, Sep. 1982.

[15] E. Sabbag, “Large deviations performance of zero–delay finite–memory lossy source codes and source–channel codes,” Master’s thesis, Tech- nion-I.I.T., Haifa, Israel, 2003.

[16] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,”IEEE Trans. Inf. Theory, vol. IT-19, no. 4, pp. 471–480, Jul.

1973.

[17] D. Teneketzis, “Optimal real-time encoding-decoding of Markov sources in noisy environments,” in Proc. Mathematical Theory of Networks and Systems (MTNS), Leuwen, Belgium, 2004.

[18] J. C. Walrand and P. Varaiya, “Optimal causal coding–decoding prob- lems,”IEEE Trans. Inf. Theory, vol. IT–29, no. 6, pp. 814–820, Nov.

1983.

[19] T. Weissman and N. Merhav, “Finite-delay lossy coding and filtering of individual sequences corrupted by noise,”IEEE Trans. Inf. Theory, vol.

48, no. 3, pp. 721–733, Mar. 2002.

[20] H. S. Witsenhausen, “On the structure of real–time source coders,”Bell Syst. Tech. J., vol. 58, no. 6, pp. 1437–1451, Jul./Aug. 1979.

Individual Convergence Rates in Empirical Vector Quantizer Design

András Antos, László Györfi, Fellow, IEEE, and András György, Member, IEEE

Abstract—We consider the rate of convergence of the expected distortion redundancy of empirically optimal vector quantizers. Earlier results show that the mean-squared distortion of an empirically optimal quantizer de- signed from independent and identically distributed (i.i.d.) source sam- ples converges uniformly to the optimum at a rate of (1 ), and that this rate is sharp in the minimax sense. We prove that for any fixed dis- tribution supported on a given finite set the convergence rate is (1 ) (faster than the minimax lower bound), where the corresponding constant depends on the source distribution. For more general source distributions we provide conditions implying a little bit worse (log )rate of con- vergence. Although these conditions, in general, are hard to verify, we show that sources with continuous densities satisfying certain regularity proper- ties (similar to the ones of Pollard that were used to prove a central limit theorem for the code points of the empirically optimal quantizers) are in- cluded in the scope of this result. In particular, scalar distributions with strictly log-concave densities with bounded support (such as the truncated Gaussian distribution) satisfy these conditions.

Index Terms—Convergence rates, fixed-rate quantization, empirical de- sign, individual convergence rate, log-concave densities.

Manuscript received October 2, 2003; revised July 28, 2005. This work was supported in part by the NATO Science Fellowship, a research grant from the Research Group for Informatics and Electronics of the Hungarian Academy of Sciences, and NKFP-2/0017/2002 project Data Riddle. The material in this cor- respondence was presented in part at the IEEE International Symposium on In- formation Theory, Chicago, IL, June/July 2004. Part of this work was performed while A. Antos and A. György were also with the Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada.

A. Antos and A. György are with the Informatics Laboratory, Computer and Automation Research Institute of the Hungarian Academy of Sciences, 1111 Budapest, Hungary (e-mail: antos@szit.bme.hu; gya@szit.bme.hu).

L. Györfi is with the Department of Computer Science and Information Theory, Budapest University of Technology and Economics, 1117 Budapest, Hungary (e-mail: gyorfi@szit.bme.hu).

Communicated by S. A. Savari, Associate Editor for Source Coding.

Digital Object Identifier 10.1109/TIT.2005.856976 0018-9448/$20.00 © 2005 IEEE

(2)

I. INTRODUCTION

The problem of empirical vector quantizer design is an important issue in data compression, since in many practical situations good source models are not available, but it is possible to collect source samples, called the training data, to gain information about the source statistics. Then the goal is to design a quantizer of a given rate, based on this data, whose average distortion on the source is as close to the distortion of the optimal quantizer (that is, one with minimum distortion) of the same rate as possible.

The usual, quite intuitive approach to this problem is empirical error minimization, which is based on the concept that if the training data de- scribes the source statistics accurately, then a quantizer that performs well on the training samples should also have a good performance on the real source. Most existing design algorithms employ this principle, and search for an empirically optimal quantizer, i.e., a quantizer min- imizing the empirical error on the training data, expecting that it will have near-optimal performance when applied to the real source. (The reader is referred to Gersho and Gray [8] for a good summary of such algorithms.) Indeed, under general conditions on the source distribu- tion, Pollard [20], [21] showed that this method is consistent when the training data consists ofnconsecutive elements of a stationary and er- godic sequence drawn according to the source distribution: he proved that the mean-squared error (MSE) distortionD(Q3n)of the empiri- cally optimal quantizerQ3n(when applied to the real source) converges with probability one to the minimum MSED3achieved by an optimal quantizer.

Obviously, the above consistency result does not provide any infor- mation on how many training samples are needed to ensure that the distortion of the empirically optimal quantizer is close to the optimum.

This question can be answered by analyzing the rate of convergence in D(Q3n) ! D3, that is, by giving finite sample upper bounds for the distortion redundancyD(Q3n)0 D3. Linderet al.[16] showed that the expected distortion redundancy (with respect to the training data) can be bounded asEEED(Q3n)0 D3 c=p

nfor some appropriate constant cfor all source distributions over a given bounded region. More pre- cisely, in [16], onlyO( log n=n)rate was shown, supported with a discussion on how to improve the convergence rate toO(1=p

n), but in the latter case the resulting constant was impractically large. A prac- tically applicable constant can be obtained by combining the results of [16] with recent results of Linder [14]. (See also [15] for a summary.) This result has been extended in many ways. An extension to vector quantizers designed for noisy channels or for “noisy” sources was given by Linderet al.[17], an extension to unbounded sources was provided by Merhav and Ziv [19], while the case of dependent (mixing) training data was examined by Zeevi [24].

Bartlettet al.[3] showed that theO(1=p

n)bound on the expected distortion redundancy is tight in the minimax sense. They proved that (for at least three quantization levels) for any empirical quantizer de- sign method, that is, when the resulting quantizerQnis an arbitrary function of the training data, and for anynlarge enough, there is a distribution in the class of distributions over a bounded region such thatEEED(Qn) 0 D3 > c=p

n. These “bad” distributions are quite simple, e.g., the distributions used in the proof are concentrated on finitely many atoms. However, the minimax lower bound gives infor- mation about the maximum distortion within the class, but not about the behavior of the distortion for a single fixed source distribution, as the sample sizenincreases. Moreover, the chosen “bad” distributions in the proof of the above result are different for alln, allowing the possi- bility that the upper bound may be improved in an individual sense, that is, a faster rate of convergence may be achievable, where the constant in the bound also depends on the (fixed) source distribution. Finding the best such individual rate (or weak rate) for the class of sources over

a bounded region was labeled in [3] as an interesting and challenging problem.

There are some results suggesting that the convergence rate can be improved toO(1=n): In the special case of a one-level quantizer, the code point of the empirically (MSE) optimal quantizer is simply the average of the training samples, and it is easy to see that in this case EE

ED(Q3n) 0 D3 = c=n, wherecis the variance of the source. Also, based on another result of Pollard [22] showing that for sources with continuous densities satisfying certain regularity properties the suitably scaled difference of the code points of the optimal and the empirically optimal quantizers has asymptotically multidimensional normal distri- bution, Chou [4] pointed out that for such sources the distortion redun- dancy decreases asO(1=n)in probability.

In this correspondence, we provide improved upper bounds for the convergence rate of the expected distortion redundancy individually for source distributions within the class of distributions over a bounded re- gion. In Theorem 1, we show thatEEED(Q3n) 0 D3 c(; N)=nfor all source distributionsconcentrated on a finite set, where the con- stantc(; N)depends on the actual source distribution and the number of quantization levelsN. The convergence rate for general source dis- tributions is considered in Theorem 2. It is shown that for source dis- tributions over a bounded region satisfying a certain regularity con- dition, the expected distortion redundancy can be upper-bounded by c(; N) log n=n, where the actual value of the constant again depends on the actual source distribution. In Corollary 1, we prove that source distributions with bounded support satisfying essentially the same con- ditions as in [22] satisfy the requirements of Theorem 2, and in Corol- lary 2 we show that the conditions of Corollary 1 hold for scalar sources having strictly log-concave densities with bounded support (such as the truncated Gaussian distribution), and for the uniform distribution, im- plyingO(log n=n)expected distortion redundancy. To give more in- sight to the problem we also illustrate that similar conditions of Har- tigan [10] (that are valid only in one dimension) also imply the condi- tion in Theorem 2.

Comparing our results with [3], it follows that the problem of em- pirical quantizer design is an interesting example of the unusual event when the orders of the minimax lower bound and the individual upper bound are different.

II. EMPIRICALVECTORQUANTIZERDESIGN

Ad-dimensionalN-levelvector quantizeris a measurable mapping Q : d ! C, where thecodebookC = fy1; . . . ; yNg dis a collection ofNdistinctd-vectors, called thecode points. The quantizer is completely characterized by its codebook and the sets

Si= fx 2 d: Q(x) = yig; i = 1; . . . ; N

called thecellsorpartition cells(as they form a partition of d) via the rule

Q(x) = yi; ifx 2 Si:

The setfS1; . . . ; SNgis called the partition ofQ. Throughout this correspondence, unless explicitly stated otherwise, all quantizers are assumed tobed-dimensional withN code points.

The source to be quantized is a random vectorX 2 dwith distribu- tion. We assumeEEEfkXk2g < 1, wherek 1 kdenotes the Euclidean norm. The performance of the quantizerQin quantizingXis measured by theaverage (mean-squared) distortion

D(Q) = EEEfkX 0 Q(X)k2g:

(3)

A quantizerQ3achieving the minimum distortionD3is calledoptimal.

Thus, in this case

D3= D(Q3) D(Q); for allQ 2 QN

whereQNdenotes the set of allN-level quantizers. It is well known (see, e.g., [13], [8]) that any optimal quantizer satisfies the centroid and nearest neighbor conditions, also known as the Lloyd–Max conditions.

The quantizerQsatisfies the centroid condition if each code point is chosen to minimize the distortion over its associated cell, that is,

EEEfkX 0 yik2j X 2 Sig = min

y EEEfkX 0 yk j X 2 Sig (1) and so

yi= EEEfXjX 2 Sig

for alli = 1; . . . ; N. A partitionfS1; . . . ; SNgis optimal if the quan- tizerQwith cellsS1; . . . ; SN satisfying the centroid condition is op- timal.Qis called anearest neighborquantizer, if it satisfies

kx 0 Q(x)k = min

i kx 0 yik; for allx 2 d: (2) Note that

i) a nearest neighbor quantizer is determined by its codebook fy1; . . . ; yNgwith ties arbitrarily broken;

ii) for any nonnearest neighbor quantizerQ0, a nearest neighbor quantizerQwith the same codebook has at most the same dis- tortion asQ0, that is,D(Q) D(Q0), regardless of the distri- bution ofX.

Thus, any optimal quantizer can be assumed to be nearest neighbor, and so finding an optimal quantizer is equivalent to finding its code- book. Using this observation, Pollard [21] proved that ifEEfkXkE 2g <

1, then there exists an optimal quantizer (which may not be unique).

In many situations, the distributionis unknown, and the only avail- able information about it is given in the form oftraining data, that is, a sequenceX1n = X1; . . . ; Xnofnindependent and identically dis- tributed (i.i.d.) copies ofX. The sequenceX1nis alsoassumed tobe independent ofX.X1nis used toconstruct anempirically designed quantizerQn( 1 ) = Qn(1; X1; . . . ; Xn), which is a random function depending on the training data. The goal is to produce such quantizers with performance nearD3. The performance ofQnin quantizingXis measured by thetest distortion

D(Qn) = EEEfkX 0 Qn(X)k2jX1ng = kx 0 Qn(x)k2(dx):

Note thatD(Qn)is a random variable.

Theempirical distortion(or training distortion) of anyQis given by its MSE in quantizing the training data

Dn(Q) = 1n

n i=1

kXi0 Q(Xi)k2:

Note that althoughQis a deterministic mapping, the empirical dis- tortionDn(Q)is also a random variable depending on the training dataX1n.

Assume thatQn3minimizes the empirical distortion, that is, Dn(Q3n) = min

Q2Q Dn(Q):

ThenQ3n(which is a specific empirically designed quantizer) is called anempirically optimal vector quantizer.Q3nis an optimal quantizer for the empirical distributionnof the training data given as

n(A) = 1n

n i=1

IfX 2Ag

for every Borel setA d, whereIEdenotes the indicator function of the eventE. Note thatQ3nalways exists (although it is not neces- sarily unique), and we can assume that it is a nearest neighbor quantizer.

UsingQ3nas an approximation of the optimalQ3is consistent in the sense that its test distortion converges to the optimal distortion, that is,

n!1limD(Q3n) = D3

almost surely for anyN 1ifEfkXk2g < 1, see [20], [21].

III. RATE OFCONVERGENCE

Todetermine the number of training samples necessary toachieve a preassigned level of distortion, the finite sample behavior of theex- pected distortion redundancy

E

EED(Q3n) 0 D3

has tobe analyzed. Todothis we assume the peak power constraint

PPP fkXk Bg = 1 (3)

and this assumption will be in effect throughout the correspondence. In other words, the distributionof the sourceXis an element ofP(B), the family of distributions supported on the sphere

S(B) = fx 2 d: kxk Bg:

An important consequence of (3) is that it is sufficient for our purpose to consider only quantizers with code points in the sphereS(B), since otherwise projecting a code point that is not inS(B)tothe surface of S(B)clearly reduces the distortion.

It is of interest how fastEEED(Q3n)converges toD3. Toour knowl- edge, the best accessible upper bound can be obtained by combining results of [16] with recent developments of [14], implying

0 sup

2P(B)(EEED(Q3n) 0 D3) cuB2 Nd

n (4)

for alln 1, wherecu = 192. A natural question is whether there exists a method, perhaps different from empirical distortion minimiza- tion, which provides an empirically designed quantizer with substan- tially smaller test distortion. In case ofN = 1, it is easy tosee that

EEED(Q3n) 0 D3= Var(X)n :

Thus, the convergence rate isO(1=n), substantially faster than the O(1=p

n)rate above. However, forN 3, the lower bound in [3]

shows that theO(1=p

n)convergence rate above cannot be improved in the minimax sense: There it is proved that ifN 3, then for any empirically designed quantizerQntrained onn n0 104N sam- ples, we have

2P(B)sup (EED(QE n) 0 D3) clB2 N104=d

n (5)

(4)

wherecl 2:67110011. This result has recently been improved in [1], extending the results tothe caseN = 2, and improving the constantcl

toapproximately1:68 1 1004and the constantn0to8N.

The results (4) and (5) imply that there exist positive constants cl(N; B; d)andcu(N; B; d)depending onN; B;anddsuch that

cl(N; B; d)p n inf

Q sup

2P(B)(EEED(Qn) 0 D3) cu(N; B; d)p n where the infimum is taken over all empirically designed quantizers Qn(recall thatQn = Qn(1; X1; . . . ; Xn)is a functionQn : d2

dn! d). That is, theminimax boundson the rate of convergence are 2(1=pn). The minimax lower bound expresses the minimum achiev- able worst case error which is achievable for a given sample sizenfor the distribution classP(B), by describing the behavior of any quan- tizer design method for a source distribution which is the least suitable for the givennand the given design method.

The “bad” distributions achieving the supremum of the expected dis- tortion redundancy in (5) may be different for eachn. Indeed, in the construction of [3] (or [1]), although the bad distributions are concen- trated on the same finitely many atoms for eachn, the exact probability mass function of the bad distribution depends onn. Thus, the bound does not tell the behavior of the distortion redundancy for asingle fixed source distribution. For example, it does not exclude the possibility that for some sequence of empirical quantizersfQng,EEED(Qn)0 D3 converges to0atO(1=n)rate forevery fixed, that is, it may be pos- sible toget fasterindividual upper boundsof the form

EEED(Qn) 0 D3 c(; N)

n (6)

for each 2 P(B)andn 1, where the constantc(; N)depends on the source distribution. This type of upper bounds is the main purpose of this correspondence. Next we show (6) for discrete distributions, and with an additional factorlog nfor general distributions satisfying some regularity condition. The proofs are deferred to the next section.

Our first result shows that the expected distortion redundancy con- verges to0at a rateO(1=n)for any fixed source distribution concen- trated on a finite set of points in d.

Theorem 1: Assume that the source distributionis concentrated on finitely many atoms. Then

EEED(Q3n) 0 D3 c(; N)n where the constantc(; N)depends onandN.

The main idea in the proof is that with high probability, the empirical and the real source distributions are so close that the corresponding op- timal quantizer partitions coincide, and in this case we only need to find the centroid of these partitions. Proving similar rate of convergence re- sults for more general source distributions is significantly harder, since the partitions of optimal quantizers for “close” distributions are dif- ferent in general.

Next we give conditions on the source distribution which en- sure that the expected distortion redundancy converges to 0at rate O(log n=n). For a nearest neighbor quantizerQlet

1Q(x) = kx 0 Q(x)k20 kx 0 QQ3(x)k2

for allx 2 S(B), whereQ3Qis the “closest” optimal nearest neighbor quantizer toQin the sense that it achieves the minimum

^ min

Q:D( ^Q)=D VarfkX 0 Q(X)k20 kX 0 ^Q(X)k2g: (7)

If the minimizingQ^is not unique,Q3Qcan be chosen arbitrarily from among the optimal nearest neighbor quantizers realizing the above min- imum. Note that the minimum can always be achieved as can be seen by a continuity-compactness type argument. This minimization is in- troduced to avoid problems that occur if the optimal quantizerQ3is not unique.

Theorem 2: Assume thatPPP fkXk Bg = 1, and letQ3nbe an empirically optimal quantizer. Assume furthermore that

A = inf

Q:D(Q)>D

EE

Ef1Q(X)g

Varf1Q(X)g> 0 (8) where the infimum is taken over all nonoptimal nearest neighbor quan- tizers having all their code points in the sphereS(B). Then

EE

ED(Qn3) 0 D3 c1log n

n + cn2 (9)

with constants

c1= 4dN max e 0 2A ; 4B2 and

c2= 4N max e 0 2A ; 4B2 log V 3eB Np

d max e02A ; 4B2

d

wherelogdenotes the natural logarithm andV denotes the volume of the sphereS(B).

The proof of the theorem is based on a proof of [17] combined with a technique developed by Barron [2] and Leeet al.[12] (see also, e.g., [9, Ch. 16]). The essence of the latter, which is an interesting result itself, is formulated in Lemma 1 in the next section.

Remark 1: It is expected that at the expense of a more complicated analysis, the log n term can be removed from the upper bound (9), giving the desiredO(1=n)rate.

Remark 2: The constants in the preceding theorem can slightly be improved to

c1= 16dNBG01(4AB22) and

c2= 16NB2

G01(4AB2)log V 3eG01(4AB2) 4BNp

d

d

whereG01is the inverse of the functionGgiven as G(c) = ec0 c 0 1

c ; forc > 0: (10) This is shown at the end of the proof of the theorem.

Condition (8) is hard to check for general source distributions, there- fore, the scope of Theorem 2 is not clear. The next corollary shows that the theorem is valid for sources with continuous densities satisfying certain regularity properties.

Letfy31; . . . ; yN3gbe the code points of an optimal quantizer for and letfS31; . . . ; SN3gdenote the corresponding nearest neighbor cells. It is known (see [22, Lemma C and Theorem]) that if has a continuous density f with bounded support, then the distortion D(y1; . . . ; yN) = D(Q)of the nearest neighbor quantizerQwith

(5)

code points fy1; . . . ; yNg is a continuous function of the vector (y1; . . . ; yN)which has a second derivative block matrix

0(y31; . . . ; yN3) = [0ij(y31; . . . ; yN3)]

at(y13; . . . ; y3N)made up ofd2dblocks (see the equation at the bottom of the page) whereFijis the (possibly empty) common face ofS3i and Sj3(it is a convex set in a(d 0 1)-dimensional hyperplane),Idis the d 2 didentity matrix, andd01is the(d 0 1)-dimensional Lebesgue measure.1It is clear that sincefy13; . . . ; y3Ngis an optimal codebook, the matrix0(y31; . . . ; yN3)is positive semidefinite. The next corollary shows that if0(y31; . . . ; yN3)is alsopositive definite, then the desired O(log n=n)convergence rate can be established.

Corollary 1: Assume that the random variableXhas a continuous density supported inS(B), and the matrix0(y31; . . . ; yN3)is positive definite for all optimal codebooks. Then

EEED(Q3n) 0 D3= O log nn :

The conditions of Corollary 1 on the distribution are essentially the same as those of Pollard [22] (and of Chou [4]). The conditions in [22]

are weaker in the sense that there the usual assumption ofXhaving a bounded support is replaced by a tail condition. This extension might be possible for Corollary 1 and Theorem 2 at the expense of some com- plication in the proof. On the other hand, while Pollard assumes the uniqueness of an optimal quantizer, we allow multiple optima. How- ever, if the set of optimal quantizers (each represented by theN-vector of its codebook) has an accumulation point, then usually0is not pos- itive definite for all optimal codebooks. Thus, Corollary 1 is not ap- plicable, for example, for a multidimensional truncated Gaussian dis- tribution (although we suspect that the results can be sharpened to in- clude such cases, as well). Nevertheless, there are cases when the op- timal quantizer is not unique and the set of optimal codebooks does not have an accumulation point. For example, for scalar sources with symmetric, not log-concave densities with a large spike in the middle usually two asymmetric optimal two-level quantizers exist (if the den- sity is log-concave, then the optimal quantizer is unique [7], [23]).

Note that Corollary 1 implies Chou’s result for bounded distributions with the additionallog n factor. However, the other direction would require some kind of uniform integrability of the random variables fn(D(Qn3) 0 D3)gn1, which can be arbitrary large asngoes to in- finity, even for bounded source distributions.

Although it is not easy to determine in general whether or not the matrix0(y13; . . . ; y3N)is positive definite, sufficient conditions can be obtained easily in the scalar case. For example, it is easy to show that 0(y31; . . . ; y3N)is positive definite for the uniform distribution, where the unique optimal quantizer is theN-level uniform quantizer. More general, sufficient conditions can be given based on a result of Fleis- cher [7], who, while proving the uniqueness of an optimal quantizer, alsoshowed that if the source has a densityf for which the deriva- tived2log f(x)=dx2is negative over its total support, then the matrix 0(y31; . . . ; y3N)is positive definite for all optimal codebooks, and hence

1Note that Pollard [22] made a slight error in the derivation of0ijfori 6= j, and arrived to a formula with an incorrect sign.

for the unique optimal codebook. A slight modification of Fleischer’s original proof allows us to replace the condition on the second deriva- tive by the assumption thatlog f(x)is a strictly concave function (then f is called a strictly log-concave function). Thus, we obtain the fol- lowing result.

Corollary 2: Assume that the scalar random variable X has a strictly log-concave densityfsupported in the interval[0B; B](that is, log f(x)is strictly concave over its support), orX is uniformly distributed in[0B; B]. Then

E

EED(Q3n) 0 D3= O log n

n :

We note here that the result of Fleischer proving that the matrix 0(y13; . . . ; y3N) is positive definite proves that scalar sources with strictly log-concave densities and sufficiently light tails (such as one with Gaussian distribution) satisfy the conditions of Pollard’s central limit theorem [22] (this fact escaped Pollard’s attention), which also implies that for such sources the distortion redundancy isO(1=n)in probability by [4].

While Corollary 1 uses the nearest neighbor condition to capture op- timality of quantizers, another approach is to use the centroid condition instead. This approach was used by Hartigan [10], a precursor of [22]

for one dimension, who applied differentiation with respect to the quan- tization thresholds instead of differentiation with respect to the code points. This method is illustrated in the next example for the scalar case whenN = 2.

Example 1: Ford = 1andN = 2, define the split function D(t) = VarfX j X < tgPPP fX < tg + VarfX j X tgPPPfX tg that is, the minimal distortion corresponding to the partition f(01; t); [t; 1)g, and lett3= (y13+ y23)=2denote the cell boundary (or threshold) of an optimal quantizer with codebookfy31; y32g. Then if the scalar random variableXhas a continuous density in the interval [0B; B], and d Ddt (t3)is positive for any optimal thresholdt3, then

E

EED(Q3n) 0 D3= O log nn : Tosee this, rewriteD(t)as

D(t) = EEEfX2g 0 EEE2fX j X < tgPPP fX < tg

0EEE2fX j X > tgPPP fX tg:

Then d2D

dt2 (t3) = f(t3)(y230 y31) 2 0 f(t3)(y230 y31) 2PPP fX < t3gPPP fX t3g and the assumption d Ddt (t3) > 0implies that

201;1(y31; y32) = 4PPP fX < t3g 0 f(t3)(y230 y31)

4PPP fX < t3gPPP fX t3g 0 f(t3)(y230 y31)

= det 0(y13; y32)

> 0

and thus0(y13; y32)is positive definite for any optimal codebook, thus the result follows by Corollary 1.

0ij(y13; . . . ; y3N) =

2(Si3)Id0 2

l6=i

f(x)(x0y )(x0y ) d (x)

ky 0y k ; forj = i 2 f(x)(x0y )(x0y ) d (x)

ky 0y k ; forj 6= i

(1 i; j N)

(6)

Finally, a comparison of the results in this section with [3] shows that the problem of empirical quantizer design is an interesting example of the unusual event when the orders of the minimax lower bound and the individual upper bound are different (the latter being smaller).

IV. PROOFS

Proof of Theorem 1: LetS = fx1; . . . ; xmg ddenote the support ofwith corresponding probability masspi= PPP fX = xig >

0; i = 1; . . . ; m. Assumem N + 1, since otherwise E

EED(Q3n) 4B2 m

i=1

(1 0 pi)n by the power constraint (3).

Let 5 be the set of optimal partitions for . Let Pn3 = fSn;13 ; . . . ; Sn;N3 g denote the partition of an empirically optimal quantizerQ3n. Clearly,Pn3 2 5 . By the centroid rule (1) the code pointy3n;iassociated with the cellSn;i3 is the average of samples falling intothis cell. This can be computed whenevern(Sn;i3 ), the ratioof samples falling into the cell is positive. Without loss of generality, we can assume that otherwiseSn;i3 = ;in which case the definition of y3n;iis immaterial, since keeping only the code points corresponding to the nonempty cells can only increase the test distortion ofQ3n, and hence increase the expected distortion redundancy.

Decompose the expected distortion redundancy ofQ3n in the fol- lowing way:

EEED(Q3n) 0 D3= EE IE fP 25 g(D(Qn3) 0 D3)

+ EEE IfP =25 g(D(Qn3) 0 D3) : (11) Letyn;i3 = EEEfX j X 2 Sn;i3 gfor alli; no w ifPn3 2 5then the quantizer with partitionPn3and codebookfy3n;1; . . . ; y3n;Ngis optimal for, and so for the first term of (11) we have

E

EE IfP 25 g(D(Q3n)0D3)

=EEE IfP 25 g N i=1

(y3n;i0 yn;i3 )2(Sn;i3 )

=EEE IfP 25 g N i=1

If (S )>0g(y3n;i0y3n;i)2(Sn;i3 ) : (12) For any partitionP = fS1; . . . ; SNg 2 5, letyn;ibe the average of samples falling intoSi(it can be defined arbitrarily ifn(Si) = 0) and letyi= EEEfX j X 2 Sig. Then (12) can be continued as

EEE IfP 25 g N i=1

If (S )>0g(yn;i3 0 y3n;i)2(Sn;i3 )

EEE

P :P 25 N i=1

If (S )>0g(yn;i0 yi)2(Si)

=

P :P 25 N i=1

EE

E If (S )>0g

2 EEE f(yn;i0 yi)2 IfX 2S g; . . . ; IfX 2S g (Si)

=

P :P 25 N i=1

EE

E If (S )>0gVarfXjX 2 Sig nn(Si) (Si)

=

P :P 25 N i=1

EE

E IfB >0g

Bi VarfXjX 2 Sig(Si)

P :P 25 N i=1

2 VarfXjX 2 Sig

n + 1 (13)

2B2j 5jN

n + 1 (14)

whereBihas a binomial distribution with parametersnand(Si), (13) holds by Devroyeet al.([6, Lemma A.2]), andj5jdenotes the cardinality of the set5.

We also need to give a bound on the second term of (11). Pollard [21] proved that the set of optimal partitions is continuous with re- spect totheL2 Wasserstein distance of the distributions (defined by W(; 0) = inf(X;Y )(EEkX 0 Y kE 2)1=2, where the infimum is taken over all joint distributions of(X; Y )such thatXandY have distribu- tionsand0, respectively), where a sequence of sets5kof partitions converges to5if the corresponding characteristic functions converge, that is,IfP 25 g ! IfP 25g ask ! 1for each partitionP ofS. Moreover, it follows from [18] that theL2metric

(; 0) =

m i=1

((xi) 0 0(xi))2

1=2

for the family of distributions concentrated on S is as strong as the Wasserstein distance, that is, (n; ) ! 0 if and only if W(n; ) ! 0. Therefore, if(n; ) ! 0for the sequencefng of the empirical distributions, then5 ! 5. Since the number of possible partitions ofS is finite, this implies that there is a > 0 such that

5 = 5; for(n; ) < : SincePn32 5 , this and the power constraint (3) imply

EE

EfIfP =25 g(D(Qn3) 0 D3)g 4B2PPP fPn3 =2 5g 4B2PPP f(n; ) g:

Applying Markov’s inequality we obtain PPP f(n; ) g = PPP m

i=1

(n(xi) 0 pi)2 2 EEEf mi=1(n(xi) 0 pi)2g

2

= mi=1pi(1 0 pi) n2

1n2: Thus, from (11), (12), and (14) we obtain

E

EED(Q3n) 0 D3 2B2 j5jN

n + 1 + 2n2 c(; N)n where

c(; N) = 2B2(j5jN + 2=2):

(Note thatj5jis bounded above by some function ofNandm.) The proof of Theorem 2 is based on the following lemma.

Lemma 1: LetXij; i = 1; . . . ; n; j = 1; . . . N be random vari- ables such that for each fixedj; X1j; . . . ; Xnj are i.i.d. such that for eachs0 s > 0

EE

EfesX g es : Forj > 0, put

= min

jN

j

2j: Then

EE E max

jN

1 n

n i=1

Xij0 j log N

minf; s0gn: (15) If

E

EEfXijg = 0

(7)

and

jXijj K then, for anyL > 0

EE E max

jN

1 n

n i=1

Xij0 j K log N

min K3e 010LL ; L n (16) where

3= min

jN

j

Var(Xij): Proof: For the notation

Yj= 1n

n i=1

Xij0 j

we have that for anys0 s > 0 EE

EfesnY g = EEE esn( X 0 )

= e0sn (EEEfesX g)n e0sn ens e0sn +s n : Thus,

esnEEEfmax Y g EEE esn max Y

= EEE max

jNesnY

jN

EE EfesnY g

jN

e0sn (0s):

Fors = minf; s0git implies that EEE max

jNYj 1snlog

jN

e0sn (0s) log N minf; s0gn: In order to prove the second half of the lemma, notice that, for any L > 0andjxj Lwe have the inequality

ex= 1 + x + x2 1

i=2

xi02 i!

1 + x + x2

1 i=2

Li02 i!

= 1 + x + x2eL0 1 0 L L2

therefore,0 < s s0= L=Kimplies thatsjXijj L, so esX 1 + sXij+ (sXij)2eL0 1 0 L

L2 : Thus,

EEEfesX g 1 + s2Var (Xij)eL0 1 0 L

L2 es Var(X ) so (16) follows from (15).

Remark 3: Equation (16) implies different bounds for different choice ofL. IfL = 1then

K log N

minfK3e 010LL ; Lgn= K

minfK3=(e 0 2); 1glog N n

= max e 0 23 ; K log Nn : Hamers and Kohler [11] derived a similar bound

1

23+ 2K3 log N n : There is a better choice ofL. Introduce the function

G(L) = eL0 L 0 1

L ; forL > 0 and chooseLsuch that

K3 L2

eL0 1 0 L = L i.e.,

K3= G(L) i.e.,

L = G01(K3) then (16) implies the bound

K log N

G01(K3)n: (17)

Remark 4: In its spirit, Lemma 1 is similar tothe inequality of De- vroye and Lugosi [5] with the essential difference that they considered the maximum of zero mean random variables

EE E max

jN

1 n

n i=1

Xij 2 max

jNj log N n :

Proof of Theorem 2: Following the proof of [17] consider a cubic grid of widthinS(B)with the minimum number of grid points. It can be seen that if the origin of the grid is uniformly distributed in thed-dimensional cube with edge length centered at the origin of the sphere, then the expected number of grid points inside the ball is V=d. Thus, the grid inS(B)with minimum number of points has at mostV=dpoints. LetQ0Ndenote the set ofN-level nearest neighbor quantizers which have all their code points on this grid. Since for any y 2 S(B)there is any0on the grid such thatky 0 y0k p

d, fo r any quantizerQwith code points insideS(B)there is aQ0 2 Q0N such that

x2S(B)sup jkx 0 Q(x)k20 kx 0 Q0(x)k2j 4Bp d:

Letting = 4Bp

d, specifically there is a quantizerQ0n2 Q0Nsatis- fying

x2S(B)sup jkx 0 Q3n(x)k20 kx 0 Q0n(x)k2j

(8)

with probability 1, since the code points of an empirically optimal quantizer are almost surely concentrated inS(B). Concerning the car- dinality ofQ0N we have

jQ0Nj Vd

N

= VN(4Bp

d)dN0dN: (18)

ForQ = Qn0, letQndenote the optimal quantizer achieving the min- imum in (7). Using the empirical optimality ofQ3nwe proceed with the following decomposition:

EE

EfkX 0 Q3n(X)k2jX1ng 0 D3

EEEfkX 0 Q3n(X)k2jX1ng 0 EEEfkX 0 Qn(X)k2jX1ng 0 2n

n i=1

(kXi0 Q3n(Xi)k20 kXi0 Qn(Xi)k2)

3 + EEfkX 0 QE 0n(X)k2jX1ng 0 EEfkX 0 QE n(X)k2jX1ng 0 2n

n i=1

(kXi0 Q0n(Xi)k20 kXi0 Qn(Xi)k2) (19)

with probability1. Then EE

EfkX 0 Q3n(X)k2g 0 D3

EEEfEEEfkX 0 Q0n(X)k2jX1ng 0 EEEfkX 0 Qn(X)k2jX1ng 0 2n

n i=1

(kXi0 Q0n(Xi)k20 kXi0 Qn(Xi)k2)g + 3

= EEE EEf1E Q (X)jX1ng 0 2 n

n i=1

1Q (Xi) + 3

EEE max

Q2Q EEEf1Q(X)g 0 2n

n i=1

1Q(Xi) + 3

= EEE max

Q2Q

1 n

n i=1

2(EEf1E Q(X)g 0 1Q(Xi))

0 EEf1E Q(X)g + 3 4 max e 0 2A ; 4B2 log jQ0Nj

n + 3 (20)

4 max e 0 2A ; 4B2 log(VN(4Bp

d)dN0dN)

n + 3 (21)

where (20) follows from Lemma 1 withL = 1and Xi;Q= 2(EEf1E Q(X)g 0 1Q(Xi)) andQ= EEf1E Q(X)g; and (21) follows form (18). Choosing

= 4 max e 0 2A ; 4B2 dN 3n completes the proof of the theorem.

The improved constants of Remark 2 can be obtained if one uses the improved version of Lemma 1 with bound (17) instead of withL = 1 in (20).

Proof of Corollary 1: We only need to show that condition (8) is satisfied under the assumptions of the corollary; then the result follows by Theorem 2. It is easy to see that if (8) does not hold, then there is a sequence of strictly suboptimal quantizersQn2 QNconverging to an optimal quantizerQ3in the sense thatyn;i! y3i for alli, where

fyn;1; . . . ; yn;Ngdenotes the codebook ofQnandfy13; . . . ; y3Ngde- notes the codebook ofQ3, such that

n!1lim EE

Ef1Q (X)g

Varf1Q (X)g= 0: (22) In what follows, we will show that

lim inf

n!1

EEEfkX 0 Qn(X)k20 kX 0 Q3(X)k2g

VarfkX 0 Qn(X)k20 kX 0 Q3(X)k2g> 0 (23) which readily implies that (22) cannot hold, proving the corollary.

Since

D(y1; . . . ; yN) = EEEfkX 0 Q(X)k2g

is twice differentiable according to the vectoryyy =(y1; . . . ; yN)2 dN in a neighborhood of yyy3 = (y13; . . . ; y3N), where Q is a nearest neighbor quantizer with codebookfy1; . . . ; yNg[22],D(y1; . . . ; yN) has the following Taylor expansion:

D(y1; . . . ; yN) = D(y13; . . . ; y3N) + dD(yyy3)

dyyy (yyy 0 yyy3)

+ 12(yyy 0 yyy3)T0(yyy3)(yyy 0 yyy3) + o(kyyy 0 yyy3k2) wheredD(yyy3)=dyyydenotes the vector formed by the partial derivatives ofDaccording to its variables atyyy3. It is alsoshown in [22] that the derivativedD(yyy)=dyyyis made up of thed-vectors

@D(y1; . . . ; yN)=@yi= 02EEfIE fX2S g(X 0 yi)g:

However, sinceyyy3minimizesD(yyy); dD(yyy3)=dyyy = 0(here0denotes the zerovector), and so

EEEfkX 0 Q(X)k20 kX 0 Q3(X)k2g

= D(y1; . . . ; yN) 0 D(y31; . . . ; yN3)

= 12(yyy 0 yyy3)T0(yyy3)(yyy 0 yyy3) + o(kyyy 0 yyy3k2):

Furthermore, since 0(yyy3) is positive definite by assumption, its smallest eigenvalueis positive, and for anyaaa 2 dN we have

aaaT0(yyy3)aaa kaaak2 which in turn implies that

EEEfkX 0 Q(X)k20 kX 0 Q3(X)k2g

2kyyy 0 yyy3k2+ o(kyyy 0 yyy3k2): (24) Next we bound the varianceVarfkX 0 Q(X)k20 kX 0 Q3(X)k2g. Notice thatkx 0 Q(x)k2can be decomposed as

kx 0 Q(x)k2= N

i=1

kx 0 yik2Ifx2S g

+ N

i=1 1jN;j6=i

(kx 0 yjk20 kx 0 yik2)Ifx2S \S g where, as usual, Si denotes the cell ofQcorresponding to the code pointyi. Therefore,

kx 0 Q(x)k20 kx 0 Q3(x)k2

= N

i=1

(kx 0 yik20 kx 0 y3ik2)Ifx2S g

+ N

i=1 1jN;j6=i

(kx 0 yjk20 kx 0 yik2)Ifx2S \S g

=

N i=1

Ji(x) +

N

i=1 1jN;j6=i

Ki;j(x)

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

libraries of control routines work with- out modification through the PC’s USB or Ethernet port and the corresponding ports of instruments that lack IEEE 488 interfaces..

More recently, it has been shown [8] for continuous source distributions and distortion measures in the form d(x; y) = (jx 0 yj), where is an increasing convex function, that if

of (4) with 0 &lt; &lt; 1: Thus in general (5) cannot be used to obtain the asymptotic behavior of R X (D) for stationary and ergodic sources with memory and mixed marginals,

In this paper it has been shown that the higher-order semirelative sensitivity functions of an open circuit voltage transfer function can always be calculated by the

It has been shown that a chemically stable macroporous scintillating polymer resin can be synthesized by gamma-ray-induced precipitation polymer- ization technique from the solution

Our results are consistent with studies indicating the importance of sigma band activity in NREM sleep in motor memory consolidation, where it has been shown that among

A polynomial counterpart of the Seiberg-Witten invariant associated with a negative definite plumbed 3-manifold has been proposed by earlier work of the authors. It is provided by

This is because the instrument handle vi, which is passed as a parameter with every instrument driver call, already unambiguously identifies the instrument (manufacturer,