Maximal correlation and optimal representations

1.3 Representation of joint distributions

1.3.3 Maximal correlation and optimal representations

From now on, we will intensively use analogues of separation theorems for the singular values and eigenvalues for matrices, see [Rao]. In view of these, the SVD (1.27) gives the solution of the following task ofmaximal correlation, posed by Gebelein [Geb] and Rényi [Reny59a].

We are looking forψ∈ H, φ∈H^′ such that their correlation is maximum with respect to their joint distributionW. Using (1.25) and separation theorems,

ψ∈maxH φ∈H^′CorrW(ψ, φ) = max

kψk=kφk=1CovW(ψ, φ) =s1

and it is attained on the non-trivial ψ1, φ1 pair. In the ﬁnite, symmetric case, maximal correlation is related to some conditional probabilities in [Bol-Mol02].

The maximal correlation task is equivalent to the following:

kψkmin=kφk=1kψ−φk²= min

kψk=kφk=1(kψk²+kφk²−2CovW(ψ, φ)) = 2(1−s1). (1.28) Correspondence analysis ([Benz, Green], and [Bol87b]) is on the one hand, a special case of the problem of maximal correlation being X and Y ﬁnite sets, but on the other hand, it is a generalization in that we are successively ﬁnding maximal correlations under some orthogonality constraints.

The product space is now an m×n contingency table with row set X = {1, . . . , m} and column set Y ={1, . . . , n}, whereas the entrieswij ≥0 (Pm

i=1

j=1wij = 1) embody the joint distribution over the product space, with row-sums p1, . . . , pm and column-sums q1, . . . , qn as marginal distributions.

Hence, the eﬀect ofP_X :H^′→H,P_Xφ=ψis the following:

ψ(i) = 1 pi

Xn j=1

wijφ(j) = Xn j=1

wij

piqj

φ(j)qj, i= 1, . . . , m. (1.29) Therefore,P_X is an integral operator with kernelKij =_p^w_i^ij_q_j (instead of integration, we have summation with respect to the marginal measureQ).

Consider the SVD

P_X =

r−1X

k=1

skh., φki^H^′ψk,

where r is the now ﬁnite rank of the contingency table (r ≤ min{n, m}). The singular values0 = 1with the trivial ψ0= 1, φ0= 1 factor pair is disregarded as their expectation is 1 with respect to the P- and Q-measures, respectively; therefore, the summation starts from 1. If we used the kernel Kij −1, we could eliminate the trivial factors. Assume that there is no other singular value 1, i.e., our contingency table is non-degenerate. Then, by the orthogonality, the subsequent left- and right-hand side singular functions have zero expectation with respect to theP- and Q-measures, and they solve the following successive maximal correlation problem. Fork= 1, . . . , r−1, in stepkwe want to ﬁndmax CorrW(ψ, φ) subject to

VarP(ψ) = VarQ(φ) = 1, CovP(ψ, ψi) = CovQ(φ, φi) = 0, i= 0,1, . . . , k−1.

(Note that the condition for i= 0is equivalent to EP(ψ) =EQ(φ) = 0.) By [Bol87b], the maximum issk and it is attained on the ψk, φk pair.

Now, we are able to deﬁne the joint representation of the general Hilbert-spacesH, H^′ – introduced in Subsection 1.3.1 – with respect to the joint measureWin the following way.

Definition 12 We say that the pair (X,Y) of k-dimensional random vectors with compo-nents in H andH^′, respectively, form a k-dimensional representation of the product space endowed with the measure W if EPXX^T = Ik and EQYY^T = Ik (i.e., the components of X and Y are uncorrelated with zero expectation and unit variance, respectively); and the joint distribution of Xi andYi isW (i= 1, . . . , k). Further, the cost of this representation is deﬁned as

Qk(X,Y) =E_WkX−Yk².

The pair (X^∗,Y^∗)is an optimal representation if it minimizes the above cost.

Analogously to the ﬁnite case, the following representation theorem was proved in [Bol13].

Theorem 11 ([Bol13]) Representation theorem for joint distributions. Let W be a joint distribution with marginal distributions P and Q. Assume that among the singular values of the conditional expectation operator P_X :H^′→H (see (1.27)) there are at least k positive ones and denote by1> s1≥s2≥ · · · ≥sk >0 the largest ones. The minimum cost of ak-dimensional representation is2Pk

i=1(1−si)and it is attained withX^∗= (ψ1, . . . , ψk) and Y^∗= (φ1, . . . , φk), whereψi, φi is the function pair corresponding to the singular value si (i= 1, . . . , k).

We remark that when X and Y are ﬁnite sets, the solution corresponds to the SVD of the normalized contingency table. Though, this matrix seemingly does not have the same normalization as the kernel, our numerical algorithm for the SVD of a rectangular matrix is capable to ﬁnd orthogonal singular vectors in the usual Euclidean norm, which corresponds to the Lebesgue measure and not to the P- orQ-measures. This is why, in correspondence analysis, we use the SVD of the matrixCD, and back-transform the singular vectors so that to get the representatives. Observe that if we have a non-degenerate contingency table, then si <1 (i= 1, . . . , k), therefore the minimum cost is strictly positive.

In the symmetric case, we can also deﬁne a representation. Now theX,X^′ pair is identi-cally distributed, but usually not independent; they are connected with the symmetric joint measure W.

Definition 13 We say that thek-dimensional random vectorXwith components inHforms a k-dimensional representation of the product space H ×H^′ (H and H^′ are isomorphic) endowed with the symmetric measure W (and marginal measure P) if EPXX^T =Ik (i.e., the components ofXare uncorrelated with zero expectation and unit variance). Further, the cost of this representation is deﬁned as

Qk(X) =EWkX−X^′k²,

where X and X^′ are identically distributed and the joint distribution of Xi and X_i^′ is W (i= 1, . . . , k). The random vectorX^∗ is an optimal representation if it minimizes the above cost.

Theorem 12 ([Bol13]) Representation theorem for symmetric joint distributions.

LetWbe a symmetric joint distribution with marginalP. Assume that among the eigenvalues of the conditional expectation operator P_X :H^′ →H (H andH^′ are isomorphic) there are at leastkpositive ones and denote by1> λ1≥λ2≥ · · · ≥λk>0the largest ones. Then the minimum cost of ak-dimensional representation is2Pk

i=1(1−λi)and it is attained byX^∗= (ψ1, . . . , ψk)whereψi is the eigenfunction corresponding to the eigenvalue λi (i= 1, . . . k).

In the case of a ﬁnite X (vertex set of an edge-weighted graph), we have a weighted graph with edge-weightswij (Pn

i=1

j=1wij= 1). The operatorP_X deprived of the trivial

factor corresponds to its normalized modularity matrix with eigenvalues in the [-1,1] interval (1 cannot be an eigenvalue if the underlying graph is connected), and eigenfunctions which are the transformed eigenvectors. As the numerical algorithm gives an orthonormal set of the eigenvectors in Euclidean norm, some back-transformation is needed to get uncorrelated components with unit variance, therefore we use the normalized modularity matrix instead of the kernelKij= _d^w_i^ij_d_j expected from (1.29), wheredi=P

j∈Xwij is the generalized degree of vertex i (i ∈ X). We remark that the above formula for the kernel corresponds to the so-called copula transformation of the joint distribution Winto the unit square. This idea appears when vertex- and edge-weighted graphs are transformed into step-functions over [0,1]×[0,1], see the deﬁnition of graphons (e.g., [Borgsetal1]). This transformation can be performed in the non-symmetric and non-ﬁnite cases too. Also observe that neither the kernel nor the contingency table or graph is changed under measure preserving transformations of X orY, by the theory of exchangeable sequences and arrays. In particular, the labeling of the vertices or rows/columns is immaterial here.

In the framework of joint distributions, the Cheeger constant h(G) can be viewed as a conditional probability and related to the symmetric maximal correlation in the following way. The weight matrix W (with sum of its entries 1) deﬁnes a discrete symmetric joint distributionWwith the same marginalsD={d1, . . . , dn}. LetH denote the Hilbert space of V →Rrandom variables taking on at mostndiﬀerent values with probabilitiesd1, . . . , dn, and having zero expectation and ﬁnite variance. Let us take two identically distributed (i.d.) copiesψ, ψ^′∈H with joint distributionW. Then, obviously,

h(G) = min

B⊂RBorel-set ψ,ψ^′∈Hi.d.

P_D(ψ∈B)≤1/2

PW(ψ^′∈B|ψ∈B).

The symmetric maximal correlation, deﬁned by the symmetric joint distribution W, is the following (it was introduced in [Bol-Mol02]):

r1= max

ψ,ψ^′∈Hi.d.CorrW(ψ, ψ^′) = max

ψ,ψ^′∈Hi.d.

Var_Dψ=1

CovW(ψ, ψ^′).

In view of (1.28) and Theorem 12,r1= 1−λ1, providedλ1≤1.

With this notation, the result of Theorem 6 can be written in the equivalent form as follows.

Proposition 3 ([Bol-Mol02]) Let W be the symmetric joint distribution of two discrete random variables taking on at mostndiﬀerent values, where the joint probabilities ofWare the entries of the n×nsymmetric weight matrix W. If the symmetric maximal correlation r1 is nonnegative, then with it, the estimation

1−r1

2 ≤ min

B⊂RBorel-set ψ,ψ^′Hi.d.

P_D(ψ∈B)≤1/2

PW(ψ^′ ∈B|ψ∈B)≤ q

1−r²₁

holds, where we used the previous notation.

Consequently, the symmetric maximal correlation somehow regulates the minimum condi-tional probability that provided a categorical random variable takes values in a category set (with probability less than 1/2) then another copy of it (their joint distribution is W) will take values in the complementary category set. The largerr1, the smaller this minimum con-ditional probability is. In particular, ifr1is the largest absolute value eigenvalue of I−LD

(apart from the trivial 1), thenr1is the usual maximal correlation of Gebelein and Rényi.

We also remark that any or both of the starting random variables ξ, η can as well be random vectors (with real components). For example, if they have p- and q-dimensional Gaussian distribution respectively, than their maximum correlation is the largest canonical correlation between them, and it is realized by appropriate linear combinations of the com-ponents ofξ andη, respectively. Moreover, we can ﬁnd canonical correlations one after the other with corresponding function pairs (under some orthogonality constraints), as many as the rank of the cross-covariance matrix ofξandη. In fact, the whole decomposition relies on the SVD of a matrix calculated from this cross-covariance matrix and the individual covari-ance matrices ofξ andη. Note that this SVD-based treatment of the canonical correlation analysis was discussed in [Bol83] in details.

In document CLUSTERING GRAPHS AND CONTINGENCY TABLES WITH SPECTRAL METHODS Academic Doctoral Dissertation (Pldal 32-35)