• Nem Talált Eredményt

2.2 Noisy contingency tables

2.2.4 Finding the blown-up skeleton

One might wonder where the singular values of an m×n matrixA = (aij) are located if a:= maxi,j|aij| is independent ofmandn. On the one hand, the maximum singular value cannot exceed O(√mn), as it is at most qPm

i=1

Pn

j=1a2ij. On the other hand, let Q be anm×nrandom matrix with entriesaor −a(independently of each other). Consider the spectral norm of all such matrices and take the minimum of them: minQ∈{−a,+a}m×nkQk.

This quantity measures the minimum linear structure that a matrix of the same size and magnitude asAcan possess. As the Frobenius norm ofQisa√

mn, in view of the inequalities between the spectral and Frobenius norms, the above minimum is at least a2

m+n, which is exactly the order of the spectral norm of a Wigner-noise. So an m×n random matrix with independent and uniformly bounded entries under very general circumstances has at least one singular value of order greater than √

m+n. Assume there are k such singular values and the representatives by means of the corresponding singular vector pairs can be well classified into k clusters in terms of the k-variances. Under these conditions we can reconstruct a blown-up structure behind our matrix.

Theorem 23 ([Bol-Fr-Kr10]) Let Am×n be a sequence of m×n matrices of uniformly bounded, nonnegative entries, wheremandntend to infinity. Assume thatAm×nhas exactly k singular values of order greater than

m+n(k is fixed). If there are integersa≥kand b ≥k such that the a- and b-variances of the optimal row- and column-representatives are O(m+nmn ), then there is an explicit construction for a blown-up matrixBm×n(ona×bblocks) such that Am×n=Bm×n+Em×n, with kEm×nk=O(√

m+n).

Proof. In the sequel the subscriptsm and n will be dropped, for notational convenience.

We will speak in terms of microarrays (genes and conditions). Let y1, . . . ,yk ∈ Rm and x1, . . . ,xk ∈Rndenote the left- and right-hand side unit-norm singular vectors corresponding to z1, . . . , zk, the singular values of A of order larger than √

m+n. The k-dimensional representatives of the genes and conditions – that are row vectors of the m×k matrix Y = (y1, . . . ,yk) and those of the n×k matrixX = (x1, . . . ,xk), respectively – by the assumption of the theorem form a and b clusters in Rk, respectively, with sum of inner variancesO(m+nmn ). Reorder the rows and columns ofAaccording to their respective cluster memberships. Denote byy1, . . . ,ym∈Rkandx1, . . . ,xn∈Rkthe Euclidean representatives of the genes and conditions (the rows of the reorderedY andX), and lety¯1, . . . ,y¯a ∈Rk andx¯1, . . . ,x¯b∈Rk denote the cluster centers, respectively. Now let us choose the following new representation of the genes and conditions. The genes’ representatives are row vectors of them×kmatrixYe such that the first m1rows ofYe are equal toy¯1, the next m2rows to y¯2, and so on . . . the last ma rows of Ye are equal to y¯a. Then likewise, the conditions’

representatives are row vectors of the n×k matrixXfsuch that the first n1 rows ofXfare equal tox¯1, the nextn2 rows tox¯2, and so on . . . the lastnb rows ofXfare equal tox¯b.

By the considerations of Theorem 20 and the assumption for the clusters, Xk

i=1

dist2(yi, F) =Sa2(Y) =O(m+n

mn ) (2.44)

and Xk

i=1

dist2(xi, G) =Sb2(X) =O(m+n

mn ) (2.45)

hold respectively, where the k-dimensional subspace F ⊂ Rm is spanned by the column vectors of Ye, and the k-dimensional subspaceG⊂Rn is spanned by the column vectors of Xf. We follow the construction given in Lemma 3 of a setv1, . . . ,vk of orthonormal vectors within F and another setu1, . . . ,uk of orthonormal vectors withinGsuch that

Xk i=1

kyi−vik2= min

v1,...,vk

Xk i=1

kyi−vik2≤2 Xk i=1

dist2(yi, F) (2.46)

and Xk

i=1

kxi−uik2= min

u1,...,uk

Xk i=1

kxi−uik2≤2 Xk i=1

dist2(xi, G) (2.47) hold, where the minimum is taken over orthonormal sets of vectors v1, . . . ,vk ∈ F and u1, . . . ,uk ∈ G, respectively. The construction of the vectors v1, . . . ,vk is as follows (u1, . . . ,uk can be constructed in the same way). Let v1, . . . ,vk ∈ F be an arbitrary orthonormal system (obtained, e.g., by the Schmidt orthogonalization method; note that in the Lemma 3 they were given at the beginning). LetV = (v1, . . . ,vk)be anm×k matrix and

YTV=QSZT

be SVD, where the matrixS contains the singular values of the k×k matrixYTV in its main diagonal and zeros otherwise, whileQandZarek×korthogonal matrices (containing the corresponding unit-norm singular vector pairs in their columns). The orthogonal matrix R =ZQT will give the convenient orthogonal rotation of the vectors v1, . . . ,vk. That is, the column vectors of the matrixV =VRform also an orthonormal set that is the desired set v1, . . . ,vk. Define the error termsri andqi, respectively:

ri=yi−vi and qi=xi−ui (i= 1, . . . , k).

In view of (2.44) – (2.47), Xk

i=1

krik2=O(m+n

mn ) and

Xk i=1

kqik2=O(m+n

mn ). (2.48)

Consider the following decomposition:

A= Xk i=1

ziyixTi +

minX{m,n} i=k+1

ziyixTi.

The spectral norm of the second term is at most of order √ esti-mated by means of the relations

kviuTik =

Taking into consideration thatzicannot exceedΘ(√mn), whilekis fixed and, due to (2.48), we get that the spectral norms of the last three terms in (2.49) – for their finitely many subterms the triangle inequality is applicable – are at most of order√

m+n. LetB be the F andG, respectively. Both spaces consist of step-vectors; thus the matrixBis a blown-up matrix containing a×bblocks. The noise matrix is

E=

Then, provided the conditions of Theorem 23 hold, by the construction given in the proof above, an algorithm can be written that uses several SVD’s and produces the blown-up ma-trixB. ThisBcan be considered as the best blown-up approximation of the microarrayA.

At the same time, clusters of the genes and conditions are also obtained. More precisely, first we conclude the clusters from the SVD of A, rearrange the rows and columns ofA accord-ingly, and afterwards we use the above construction. If we decide to perform correspondence analysis on A, then by (2.35) and (2.39), BD will give a good approximation to AD and likewise, the correspondence vectors obtained by the SVD ofADwill give representatives of the genes and conditions.

Clustering microarray data via thek-means algorithm is also discussed in [Dhil], but with an other objective function. To find the SVD for large rectangular matrices, randomized algorithms are favored, e.g., [Ac-Mc]. In case of random matrices with an underlying linear structure (outstanding singular values), the random noise of the algorithm is just added to the noise in our data, but their sum is also a Wigner-noise, so it does not change the effect of our algorithm in finding the clusters. Under the conditions of Theorem 23, the

separated error matrix is comparable with the noise matrix, and this fact guarantees that the underlying block structure can be extracted.