Finding the blown-up skeleton - Noisy contingency tables

2.2 Noisy contingency tables

2.2.4 Finding the blown-up skeleton

One might wonder where the singular values of an m×n matrixA = (aij) are located if a:= maxi,j|aij| is independent ofmandn. On the one hand, the maximum singular value cannot exceed O(√mn), as it is at most qPm

i=1

j=1a²_ij. On the other hand, let Q be anm×nrandom matrix with entriesaor −a(independently of each other). Consider the spectral norm of all such matrices and take the minimum of them: minQ∈{−a,+a}^m×nkQk.

This quantity measures the minimum linear structure that a matrix of the same size and magnitude asAcan possess. As the Frobenius norm ofQisa√

mn, in view of the inequalities between the spectral and Frobenius norms, the above minimum is at least ^√^a₂√

m+n, which is exactly the order of the spectral norm of a Wigner-noise. So an m×n random matrix with independent and uniformly bounded entries under very general circumstances has at least one singular value of order greater than √

m+n. Assume there are k such singular values and the representatives by means of the corresponding singular vector pairs can be well classiﬁed into k clusters in terms of the k-variances. Under these conditions we can reconstruct a blown-up structure behind our matrix.

Theorem 23 ([Bol-Fr-Kr10]) Let Am×n be a sequence of m×n matrices of uniformly bounded, nonnegative entries, wheremandntend to inﬁnity. Assume thatAm×nhas exactly k singular values of order greater than √

m+n(k is ﬁxed). If there are integersa≥kand b ≥k such that the a- and b-variances of the optimal row- and column-representatives are O(^m+n_mn ), then there is an explicit construction for a blown-up matrixBm×n(ona×bblocks) such that A_m×n=B_m×n+E_m×n, with kE_m×nk=O(√

m+n).

Proof. In the sequel the subscriptsm and n will be dropped, for notational convenience.

We will speak in terms of microarrays (genes and conditions). Let y₁, . . . ,y_k ∈ R^m and x1, . . . ,xk ∈Rⁿdenote the left- and right-hand side unit-norm singular vectors corresponding to z1, . . . , zk, the singular values of A of order larger than √

m+n. The k-dimensional representatives of the genes and conditions – that are row vectors of the m×k matrix Y = (y1, . . . ,yk) and those of the n×k matrixX = (x1, . . . ,xk), respectively – by the assumption of the theorem form a and b clusters in R^k, respectively, with sum of inner variancesO(^m+n_mn ). Reorder the rows and columns ofAaccording to their respective cluster memberships. Denote byy¹, . . . ,y^m∈R^kandx¹, . . . ,xⁿ∈R^kthe Euclidean representatives of the genes and conditions (the rows of the reorderedY andX), and lety¯¹, . . . ,y¯^a ∈R^k andx¯¹, . . . ,x¯^b∈R^k denote the cluster centers, respectively. Now let us choose the following new representation of the genes and conditions. The genes’ representatives are row vectors of them×kmatrixYe such that the ﬁrst m1rows ofYe are equal toy¯¹, the next m2rows to y¯², and so on . . . the last ma rows of Ye are equal to y¯^a. Then likewise, the conditions’

representatives are row vectors of the n×k matrixXfsuch that the ﬁrst n1 rows ofXfare equal tox¯¹, the nextn2 rows tox¯², and so on . . . the lastnb rows ofXfare equal tox¯^b.

By the considerations of Theorem 20 and the assumption for the clusters, Xk

i=1

dist²(yi, F) =S_a²(Y) =O(m+n

mn ) (2.44)

and Xk

i=1

dist²(xi, G) =S_b²(X) =O(m+n

mn ) (2.45)

hold respectively, where the k-dimensional subspace F ⊂ R^m is spanned by the column vectors of Ye, and the k-dimensional subspaceG⊂Rⁿ is spanned by the column vectors of Xf. We follow the construction given in Lemma 3 of a setv₁, . . . ,v_k of orthonormal vectors within F and another setu₁, . . . ,u_k of orthonormal vectors withinGsuch that

Xk i=1

ky_i−v_ik²= min

v^′₁,...,v_k^′

Xk i=1

ky_i−v^′_ik²≤2 Xk i=1

dist²(yi, F) (2.46)

and Xk

i=1

kx_i−u_ik²= min

u^′₁,...,u^′_k

Xk i=1

kx_i−u^′_ik²≤2 Xk i=1

dist²(xi, G) (2.47) hold, where the minimum is taken over orthonormal sets of vectors v₁^′, . . . ,v_k^′ ∈ F and u^′₁, . . . ,u^′_k ∈ G, respectively. The construction of the vectors v1, . . . ,vk is as follows (u₁, . . . ,u_k can be constructed in the same way). Let v₁^′, . . . ,v_k^′ ∈ F be an arbitrary orthonormal system (obtained, e.g., by the Schmidt orthogonalization method; note that in the Lemma 3 they were given at the beginning). LetV^′ = (v^′₁, . . . ,v^′_k)be anm×k matrix and

Y^TV^′=QSZ^T

be SVD, where the matrixS contains the singular values of the k×k matrixY^TV^′ in its main diagonal and zeros otherwise, whileQandZarek×korthogonal matrices (containing the corresponding unit-norm singular vector pairs in their columns). The orthogonal matrix R =ZQ^T will give the convenient orthogonal rotation of the vectors v^′₁, . . . ,v^′_k. That is, the column vectors of the matrixV =V^′Rform also an orthonormal set that is the desired set v₁, . . . ,v_k. Deﬁne the error termsr_i andq_i, respectively:

ri=yi−vi and qi=xi−ui (i= 1, . . . , k).

In view of (2.44) – (2.47), Xk

i=1

kr_ik²=O(m+n

mn ) and

Xk i=1

kq_ik²=O(m+n

mn ). (2.48)

Consider the following decomposition:

A= Xk i=1

ziyix^T_i +

minX{m,n} i=k+1

ziyix^T_i.

The spectral norm of the second term is at most of order √ esti-mated by means of the relations

kv_iu^T_ik =

Taking into consideration thatzicannot exceedΘ(√mn), whilekis ﬁxed and, due to (2.48), we get that the spectral norms of the last three terms in (2.49) – for their ﬁnitely many subterms the triangle inequality is applicable – are at most of order√

m+n. LetB be the F andG, respectively. Both spaces consist of step-vectors; thus the matrixBis a blown-up matrix containing a×bblocks. The noise matrix is

Then, provided the conditions of Theorem 23 hold, by the construction given in the proof above, an algorithm can be written that uses several SVD’s and produces the blown-up ma-trixB. ThisBcan be considered as the best blown-up approximation of the microarrayA.

At the same time, clusters of the genes and conditions are also obtained. More precisely, ﬁrst we conclude the clusters from the SVD of A, rearrange the rows and columns ofA accord-ingly, and afterwards we use the above construction. If we decide to perform correspondence analysis on A, then by (2.35) and (2.39), BD will give a good approximation to AD and likewise, the correspondence vectors obtained by the SVD ofADwill give representatives of the genes and conditions.

Clustering microarray data via thek-means algorithm is also discussed in [Dhil], but with an other objective function. To ﬁnd the SVD for large rectangular matrices, randomized algorithms are favored, e.g., [Ac-Mc]. In case of random matrices with an underlying linear structure (outstanding singular values), the random noise of the algorithm is just added to the noise in our data, but their sum is also a Wigner-noise, so it does not change the eﬀect of our algorithm in ﬁnding the clusters. Under the conditions of Theorem 23, the

separated error matrix is comparable with the noise matrix, and this fact guarantees that the underlying block structure can be extracted.

In document CLUSTERING GRAPHS AND CONTINGENCY TABLES WITH SPECTRAL METHODS Academic Doctoral Dissertation (Pldal 65-68)