Reproducing kernels and correspondence matrices

(1)

sustainability of the Research University Centre of Excellence

at the University of Szeged by ensuring the rising generation of excellent scientists.””

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

Reproducing kernels

and correspondence matrices

Marianna Bolla

(Budapest University of Technology and Economics)

TÁMOP‐4.2.2/B‐10/1‐2010‐0012 project

(2)

Reproducing kernels and correspondence matrices

Marianna Bolla Institute of Mathematics Technical University of Budapest

(Research supported by the

T´AMOP-4.2.2.C-11/1/KONV-2012-0001 project) marib@math.bme.hu

Andr´as Kr´amli is 70, Szeged

July 27, 2013

(3)

Three dialogs of Hacsek and Saj´ o

in a coffee-house

(4)

First dialog: about ML estimation in exponential families

S:Did you know that in exponential families the ML-equation boils down to

Eθ(t(X)) =t(sample),

whereX= (X1, . . . ,Xn) is i.i.d. sample and t(X) is sufficient statistic for the unknown parameter θ∈R^k?

H: How can it boil down, and what kind of a family is your exponential?

(5)

S:You are stupid, the likelihood function of the sample X= (X1, . . . ,Xn) in exponential family looks like

Lθ(X) = 1

a(θ) ·e^t(X)θ^T·b(X), whereθ= (θ₁, . . . , θ_k) is natural parameter,

t(X) = (t₁(X), . . . ,t_k(X)) is sufficient statistic for it, and^T stands for the transposition.

H: And what about the a(θ)?

S:It is the normalizing constant, but can be written as a(θ) =

Z

X

e^t(x)θ^T ·b(x)dx,

whereX ⊂Rⁿ is the sample space. This formula will play a crucial rule in our subsequent calculations.

H: Haha, let us see those famous calculations!

(6)

S:As you know, the likelihood equation is

∇_θlnL_θ(X) =0, that is

− ∇_θlna(θ) +∇_θ(t(X)θ^T) =0. (1) Under certain regularity conditions,

∇_θlna(θ) = Z

X

t(x)e^t(x)θ^T ·b(x)dx=Eθ(t(X)).

Therefore, (1) is equivalent to

−Eθ(t(X)) +t(X) =0, which finishes the proof.

(7)

H: But this is the idea of moment estimation. Is it true that in exponential families the ML-estimator is the same as the moment-estimator?

S:More or less. When, especially, t1(X) =_n¹Pn

i=1X_i, . . . , t_k(X) = _n¹Pn

i=1X_i^k, then it is. This is the case when our underlying distribution is Poisson, exponential, or Gaussian.

H: I will tell it to our colleague, Mogyoro (Imre Toth), who asked whether these two estimators are the same.

S:Not always, think of the continuous uniform distribution, which does not belong to the exponential family. Anyway, we had a prosperous discussion.

H: Will you come in tomorrow?

(8)

Second dialog: about Correspondence Analysis

H: My dear Saj´o, you told me about correspondence analysis.

I have studied it, but believe me, it is a stupid method. How does it come, that the best low-rank approximation of the table has negative entries in most of the cases?

S:You are right if you consider the best L2-norm

approximation. Nonetheless, I am able to slightly adjust that approximation to obtain a low-rank approximation of positive entries, under very general conditions. My method also reveals the block-structure of the table. I was speaking about these facts at the EMS2013, but I am repeating the most important notions now.

H: OK, let us see.

(9)

SVD of contingency tables and correspondence matrices

C= (c_ij): n×mcontingency table, c_ij ≥0.

Row set: Row ={1, . . . ,n}

Column set: Col={1, . . . ,m}

d_row_,i =

m

X

j=1

c_ij (i = 1, . . . ,n)

d_col,j =

n

X

i=1

c_ij (j = 1, . . . ,m)

D_row =diag(d_row,1, . . . ,d_row_,n) D_col =diag(d_col,1, . . . ,d_col,m).

(10)

Representation

For a given integer 1≤k ≤min{n,m}, we are looking for

k-dimensional representativesr₁, . . . ,r_n of the rows andc₁, . . . ,c_m of the columns such that they minimize the objective function

Q_k =

n

X

i=1 m

X

j=1

c_ijkr_i−c_jk² (2) subject to

n

X

i=1

d_row,ir_ir^T_i =I_k,

m

X

j=1

d_col,jc_jc^T_j =I_k. (3) When minimized, the objective functionQ_k favorsk-dimensional placement of the rows and columns such that representatives of highly associated rows and columns are forced to be close to each other. As we will see, this is equivalent to the problem of

correspondence analysis.

(11)

Solution

X:= (x1, . . . ,x_k) = (r^T₁, . . . ,r_n^T)^T n×k Y:= (y₁, . . . ,y_k) = (c^T₁, . . . ,c^T_m)^T m×k

Q_k = 2k−tr(D^1/2rowX)^TCcorr(D^1/2_colY)→min subject to

X^TD_rowX=I_k, Y^TD_colY=I_k,

whereC_corr =D^−1/2_row CD^−1/2_col : correspondence matrix (normalized contingency table)belonging to the tableC.

(12)

Representation theorem

LetC_corr =Pr

i=1s_iv_iu^T_i be SVD, wherer ≤min{n,m} is the rank ofCcorr, or equivalently (since there are not identically zero rows or columns), the rank ofCand 1 =s₁≥s₂ ≥ · · · ≥s_r >0.

v₁= (p

d_row,1, . . . ,p

d_row,n)^T andu₁ = (p

d_col_,1, . . . ,p

d_col_,m)^T. Letk ≤r be a positive integer such that sk >sk+1. Then

minQk = 2k−

k

X

i=1

si

and it is attained withX^∗ =D^−1/2row (v1, . . . ,v_k) and Y^∗ =D^−1/2_col (u₁, . . . ,u_k).

(13)

Regular row-column cluster pairs

TheExpander Mixing Lemma for edge-weighted graphs naturally extends to this situation (Butler): for allR ⊂Row andC ⊂Col

|c(R,C)−Vol(R)Vol(C)| ≤s₂p

Vol(R)Vol(C), wheres₂ is the largest but 1 singular value ofC_corr and

Vol(Ra) =X

i∈Ra

drow,i, Vol(Cb) = X

j∈C_b

dcol,j.

Since the spectral gap ofCcorr is 1−s2, in view of the above Expander Mixing Lemma,’large’ spectral gapis an indication of

’small’ discrepancy: the weighted cut between any row and column subset of the contingency table is near to what is expected in a random table.

(14)

Volume-regular cluster pairs

We extend the notion of discrepancy to volume-regular pairs.

Definition

The row–column cluster pairR ⊂Row,C ⊂Col of the

contingency tableCof total volume 1 isγ-volume regular if for all X ⊂R andY ⊂C the relation

|c(X,Y)−ρ(R,C)Vol(X)Vol(Y)| ≤γp

Vol(R)Vol(C) holds, whereρ(R,C) = Vol(R)Vol(C^c(R,C⁾ ) is the relative inter-cluster density of the row–column pairR,C.

We will show that for givenk, if the clusters are formed via applying the weightedk-means algorithm for the optimal row- and column representatives, respectively, then the so obtained

row–column cluster pairs are homogeneous in the sense that they form equally dense parts of the contingency table.

(15)

Weighted k -variance

Theweightedk-varianceof the k-dimensional row representatives is defined by

S_k²(X) = min

(R1,...,R_k) k

X

a=1

X

j∈Ra

drow,jkr_j −¯rak², where ¯ra= _Vol(R¹

a)

P

j∈Radrow,jrj is the weighted center of cluster R_a (a= 1, . . . ,k). Similarly, the weightedk-variance of the k-dimensional column representatives is

S_k²(Y) = min

(C1,...,Ck) k

X

a=1

X

j∈Ca

d_col,jkc_j −¯c_ak², where ¯ca= _Vol(C¹

a)

P

j∈Cadcol,jcj is the weighted center of cluster C_a (a= 1, . . . ,k). Observe, that the trivial vector components can be omitted, and thek-variance of the so obtained

(k−1)-dimensional representatives will be the same.

(16)

Thm (B, European Journal of Combinatorics 2013)

Theorem

LetCbe a contingency table of n-element row set Row and m-element column set Col , with row- and column sums

d_row,1, . . . ,d_row,nand d_col,1, . . . ,d_col,m, respectively. Suppose that Pn

i=1

Pm

j=1cij = 1 and there are no dominant rows and columns:

d_row,i = Θ(1/n),(i = 1, . . . ,n) and d_col_,j = Θ(1/m),

(j = 1, . . . ,m) as n,m→ ∞. Let the singular values ofC_corr be 1 =s₁ >s₂ ≥ · · · ≥s_k > ε≥s_i, i ≥k+ 1.

The partition(R1, . . . ,Rk) of Row and(C1, . . . ,Ck) of Col are defined so that they minimize the weighted k-variances S_k²(X) and S_k²(Y) of the row and column representatives. Suppose that there are constants0<K1,K2 ≤ ¹_k such that |R_i| ≥K1n and

|C_i| ≥K₂m (i = 1, . . . ,k), respectively. Then the R_i,C_j pairs are O(√

2k(S_k(X)S_k(Y)) +ε)-volume regular(i,j = 1, . . . ,k).

(17)

The proof

H: But how does the positivity of thek-rank approximation follow from this?

S:I am afraid, you have to understand some basic steps of the proof for this, as follows.

Recall that the largest singular value ofCcorr is 1 with

corresponding singular vector pairv₀ =D^1/2_row1_m andu₀=D^1/2_col1_n, respectively. The optimalk-dimensional representatives of the rows and columns are row vectors of the matricesX= (x0, . . . ,xk−1) andY= (y₀, . . . ,yk−1), where x_i =D^−1/2row v_i andy_i =D^−1/2_col u_i, respectively (i = 0, . . . ,k−1). (Note that the first columns of equal coordinates can as well be omitted.)

(18)

Assume that the minimumk-variance is attained on thek-partition (R₁, . . . ,R_k) of the rows and (C₁, . . . ,C_k) of the columns,

respectively. Then

S_k²(X) =

k−1

X

i=0

dist²(vi,F), S_k²(Y) =

k−1

X

i=0

dist²(ui,G),

whereF =Span{D^1/2_roww1, . . . ,D^1/2roww_k} and

G =Span{D^1/2_colz₁, . . . ,D^1/2_colz_k} with the so-called normalized row partition vectorsw₁, . . . ,w_k of coordinatesw_ji = √ ¹

Vol(Ri) ifj ∈R_i and 0, otherwise; and column partition vectorsz₁, . . . ,z_k of coordinatesz_ji = √ ¹

Vol(Ci) ifj ∈C_i and 0, otherwise (i = 1, . . . ,k).

Note that the vectorsD^1/2_roww₁, . . . ,D^1/2_roww_k and

D^1/2_colz1, . . . ,D^1/2_colzk form orthonormal systems inRⁿ andR^m, respectively (but they are, usually, not complete).

(19)

However, we can find orthonormal systems ˜v₀, . . . ,˜v_k−1 ∈F and

˜u0, . . . ,u˜k−1 ∈G such that

S_k²(X)≤

k−1

X

i=0

kv_i−˜vik² ≤2S_k²(X), S_k²(Y)≤

k−1

X

i=0

ku_i−˜uik² ≤2S_k²(Y).

LetCcorr =Pr−1

i=0siviu^T_i be SVD, where

r=rank(C) =rank(Ccorr). We approximateCcorr by the rankk matrixPk−1

i=0 s_i˜v_iu˜^T_i .

(20)

The vectors ˆv_i :=D^−1/2row ˜v_i are stepwise constants on the partition (R₁, . . . ,R_k) of the rows, whereas the vectors ˆu_i :=D^−1/2_col u˜_i are stepwise constants on the partition (C1, . . . ,Ck) of the columns, i = 0, . . . ,k−1. The matrix

k−1

X

i=0

s_iˆv_iuˆ^T_i

is therefore ann×m block-matrix onk×k blocks corresponding to the above partition of the rows and columns. Letˆc_ab denote its entries in theab block (a,b= 1, . . . ,k).

This is the rankk approximation of the matrix Cwith a block-matrix.

(21)

The point is: The entries ˆcij’s of the block-matrix will already be positiveprovided the weightedk-variancesS_k²(X) andS_k²(Y) are

’small’ enough. Let us discuss this issue more precisely.

In accord with the notation used in the proof, denote byab in the lower index the matrix restricted to theRa×Cb block (otherwise it has zero entries). Then for the squared Frobenius norm of the rank k approximation of D⁻¹_rowCD⁻¹_col, restricted to the ab block, we have that

D⁻¹_row,aC_abD⁻¹_col,b−(

k−1

X

i=0

s_iˆv_iuˆ^T_i )_ab

2

= X

i∈Ra

X

j∈Cb

( c_ij

d_row_,id_col,j −ˆc_ab)²

=X

i∈Ra

X

j∈C_b

( c_ij drow,idcol,j

−¯cab)²+|R_a||C_b|(¯cab−ˆcab)².

(22)

Here we used the Steiner equality with the average ¯c_ab of the entries ofD⁻¹_rowCD⁻¹_col in the ab block. We can estimate the above Frobenius norm by a constant multiple of the spectral norm.

In this way,

(¯c_ab−ˆc_ab)² ≤ 1

max{|R_a|,|C_b|}·max

i∈R_a

1 d_row_,i·max

j∈C_b

1 d_col,j·[√

2k(S_k(X)+S_k(Y))+ε]². But using the conditions on the block sizes and the row- and

column-sums of the Theorem, provided

√

2k(S_k(X) +S_k(Y)) +ε) =O 1 (min{m,n})¹²^+τ

!

holds with some ’small’τ >0, the relation ¯c_ab−ˆc_ab →0 also holds asn,m→ ∞. Therefore, both ˆcab and ˆcabdrow,idcol,j are positive over such blocks that are not constantly zero in the original table ifmand n are large enough.

(23)

S:This is the end of the long story.

H: Will you come in tomorrow?

(24)

Third dialog: about Reproducing Kernel Hilbert Spaces

H: My dear Saj´o, there are fairy tales about some fictitious spaces, where everything is ‘smooth’ and ‘linear’.

S:Such spaces really exist, the hard part is that we should adopt them to our data. Good news is that it is not necessary to actually map our data into them.

H: Then how can we use them?

S:It suffices to treat only a kernel function, but the bad news is that the kernel must be appropriately selected so that the underlying nonlinearity could be detected.

H: They must be theReproducing Kernel Hilbert Spaces. Tell me more about them!

(25)

History

S:Reproducing Kernel Hilbert Spaces were introduced in the middle of the 20th century byAronszajn,Parzen, and others, but the theory itself is an elegant application of already known theorems of functional analysis, first of all theRiesz–Fr´echet representation theoremand the theory of integral operators (see Fr´echet, Riesz, Sz˝okefalvi-Nagy) tracing back to the beginning of the 20th century. Later on, in the last decades of the 20th century and even in our days, Reproducing Kernel Hilbert Spaces are several times reinvented and applied in modern statistical methods and data mining (for example, Bach,Baker).

H: But what is the mystery of reproducing kernels and what is the diabolic kernel trick?

(26)

Definition of an RKHS

A stronger condition imposed on a Hilbert spaceHof functions X →R(whereX is an arbitrary set, for the time being) is that the following so-calledevaluation mapping be a continuous, or

equivalently, a bounded linear functional. The evaluation mapping L_x : H →Rworks on an f ∈ H such thatL_x(f) =f(x).

Definition

A Hilbert spaceHof (real) functions on the set X is an RKHS if the point evaluation functionalLx exists and is continuous for all x∈ X.

(27)

H: And where does the name RKHS come from?

S:From the Riesz–Fr´echet representation theorem. This theorem states that a Hilbert space (in our case H) and its dual (in our case the set of H →Rcontinuous linear

functionals, e.g. L_x) are isometrically isomorphic. Therefore, to any Lx there uniquely corresponds aKx ∈ H such that

L_x(f) =hf,K_xi_H, ∀f ∈ H. (4) Since K_x is itself anX →Rfunction, it can be evaluated at any pointy ∈ X. We define the bivariate function

K :X × X →Ras

K(x,y) :=Kx(y) (5) and call it the reproducing kernel for the Hilbert space H.

(28)

H: But why is this kernel positive semidefinite?

S:Because by using formulas (4) and (5), we get that on the one hand,

K(x,y) =K_x(y) =L_y(K_x) =hK_x,K_yi_H, and on the other hand,

K(y,x) =Ky(x) =Lx(Ky) =hK_y,Kxi_H.

By the symmetry of the (real) inner product it follows that the reproducing kernel is symmetric and it is also reproduced as the inner product of special functions in the RKHS:

K(x,y) =hK_x,K_yi_H=hK(x, .),K(.,y)i_H, hence, K is positive semidefinite.

(29)

RKHS belonging to a kernel

S:Vice versa, if we are given a positive definite kernel function onX × X at the beginning, then there exists an RKHS such that with appropriate elements of it, the inner product relation holds.

Definition

A symmetric two-variate functionK :X × X →Ris called positive definite kernel (equivalently, admissible, valid, or Mercer kernel) if for anyn∈N andx1, . . . ,xn∈ X, the symmetric matrix of entries K(x_i,x_j) =K(x_j,x_i) (i,j = 1, . . .n) is positive semidefinite.

We remark that a symmetric real matrix is positive

semidefinite if and only if it is a Gram matrix, and hence, its entries become inner products, but usually not of the entries in its arguments. However, the simplest kernel function, the so-calledlinear kernel, does this job: K_lin(x,y) =hx,yi_X, whereX is subset of a Euclidean space.

(30)

H: Show me other positive definite kernels!

S:You can get a lot of them with the following operations:

1 IfK1,K2: X × X →Rare positive definite kernels, then the kernelK defined byK(x,y) =K1(x,y) +K2(x,y) is also positive definite.

2 IfK1,K2: X × X →Rare positive definite kernels, then the kernelK defined byK(x,y) =K1(x,y)K2(x,y) is also positive definite. Especially, ifK is a positive definite kernel, then so doescK with anyc>0.

Consequently, if h is a polynomial with positive coefficients andK : X × X →Ris a positive definite kernel, then the kernel K_h: X × X →R defined by

K_h(x,y) =h(K(x,y))

is also positive definite. Since the exponential function can be approximated by polynomials with positive coefficients and the positive definiteness is closed under pointwise convergence, the same is true if h is the exponential function: h(x) =e^x, perhaps some transformation of it.

(31)

H: Then putting these facts together and using the formula kx−yk²=hx,xi+hy,yi −2hx,yi,

we can easily verify that the Gaussian kernel is positive definite:

KGauss(x,y) =e⁻

kx−yk2 2σ2 , whereσ >0 is a parameter.

S:You are getting more and more clever, Hacsek. Now we are able to formulate the converse statement.

Theorem

For any positive definite kernel K : X × X →Rthere exists a unique, possibly infinite-dimensional Hilbert spaceH of functions onX, for which K is a reproducing kernel.

If we want to emphasize that the RKHS corresponds to the kernel K, we will denote it by H_K.

(32)

H: Cannot we realize the elements of H_K in a more straightforward Hilbert space?

S:Oh, yes, this is thefeature space F. Assume that there is a (usuallynot linear) mapφ:X → F such that whenx ∈ X is mapped intoφ(x)∈ F, then

K(x,y) =hφ(x), φ(y)i_F is the desired positive definite kernel.

Let T be a linear operator from F to the space of functions X →Rdefined by

(Tf)(y) =hf, φ(y)i_F, y ∈ X,f ∈ F.

Then

Tφ(x) =K_x, ∀x ∈ X

and hence, H_K becomes the range ofT.

(33)

H: Could you tell me an example of an RKHS? What an animal is it?

S:Yes, a couple. I’ll give the theoretical construction for H_k andF, together with the functionsKx and the featuresφ(x).

(34)

First example

LetK be the continuous kernel of a positive definite

Hilbert–Schmidt operator which is an integral operator working on theL²(X) space, whereX is a compact set inRfor simplicity.

Due to the positive definiteness ofK, and the Mercer theorem,K can be expanded into the following uniformly convergent series:

K(x,y) =

∞

X

i=1

λ_iψ_i(x)ψ_i(y), ∀x,y∈ X

by the eigenfunctions and the eigenvalues of the integral operator.

The RKHS defined byK is the following:

H_K ={f :X →R : f(x) =

∞

X

i=1

ciψi(x)}

such thatP∞ i=1

c_i² λi <∞.

(35)

K_x =K(x, .) =

∞

X

i=1

λ_iψ_i(x)ψ_i Therefore,

hK_x,K_yi_H_K =

∞

X

i=1

λ_iψ_i(x)λ_iψ_i(y) λi

=K(x,y).

Here the feature spaceF is the following:

φ(x) = (p

λ₁ψ₁(x),p

λ₂ψ₂(x), . . .), x ∈ X and the inner product is naturally defined by

hφ(x), φ(y)i_F =

∞

X

i=1

λ_iψ_i(x)ψ_i(y) =hK_x,K_yi_H_K. The functions√

λ1ψ1,√

λ2ψ2, . . ., which form an orthonormal basis inH_k, and because of this transformation, a function f ∈L²(X) is in H_k ifkfk²_H

k =P∞

i=1 c_i² λi <∞.

(36)

Second example

NowX is a Hilbert space of finite dimension, say R^p, and its elements will be denoted by boldfacex, stressing that they are vectors.

If we usedK_lin on X × X, thenK_x=hx, .iX, and by the

Riesz–Fr´echet representation theorem,φ(x) =xwould reproduce the kernel, asK_lin(x,y) =hx,yi_X for all x,y∈ X. Now the RKHS induced byK_lin is identified with the feature space, which is X =R^p itself.

In case of more sophisticated kernels,H_K contains non-linear functions, and therefore, the featuresφ(x) can be realized usually in much higher dimension than that ofX.

(37)

Forx= (x1,x2)∈ X =R² let

φ(x) := (x₁²,x₂²,√ 2x₁x₂)

F ⊂R³. We want to separate data points allocated along two concentric circles, and thereforeR² →Rquadratic functions are applied.

hφ(x), φ(y)i_F =x₁²y₁²+x₂²y₂²+2x1x2y1y2 = (x1y1+x2y2)² =hx,yi²_X, hence, the new kernel is the square of the linear one, which is also positive definite (polynomial kernel).

The RKHSH_K corresponding to the feature space F now consists of homogeneous degree quadratic functionsR² →R, with the functionsf1: (x1,x2)→x₁²,f2 : (x1,x2)→x₂², and

f₃: (x₁,x₂)→√

2x₁x₂ forming an orthonormal basis inH_k such thatK(x,y) =P3

i=1f_i(x)f_i(y).

(38)

S:Observe that in both examplesφ(x) is a vector with coordinates which are the basis vectors of the RKHS evaluated atx. In the first exercise φ(x) is an infinite, whereas in the second one, a finite dimensional vector. Note that H_K is an affine and sparsified version of the Hilbert space ofX →R functions, between which the inner product is adopted to the requirement that it would reproduce the kernel.

H: For me it means that non-linear features can be

represented by a more complicated Hilbert space, and not the original one. Am I right, my dear Saj´o?

S:Not exactly, but probably this is the point of the whole RKHS story.

(39)

Reproducing kernels and correspondence matrices

Doctoral School of Mathematics and Computer Science

Stochastic Days in Szeged 27.07.2012.

Reproducing kernels

and correspondence matrices

Marianna Bolla

(Budapest University of Technology and Economics)

Reproducing kernels and correspondence matrices

Three dialogs of Hacsek and Saj´ o

in a coffee-house

First dialog: about ML estimation in exponential families

Second dialog: about Correspondence Analysis

SVD of contingency tables and correspondence matrices

Representation

Solution

Representation theorem

Regular row-column cluster pairs

Volume-regular cluster pairs

Weighted k -variance

Thm (B, European Journal of Combinatorics 2013)

The proof

Third dialog: about Reproducing Kernel Hilbert Spaces

History

Definition of an RKHS

RKHS belonging to a kernel

First example

Second example

The end