Background - Multiple kernel data and knowledge fusion methods in drug discovery

The heterogeneity of biomedical data and knowledge is frequently expressed in a dual form. The earlier approach of the two is to utilize the language of multiple pairwise similarities, which pro-vides a unifying, but ultimately “descriptive” picture, as these similarities are usually computed from various descriptions and representations of the entities. This approach has been evolving together with its more technical counterpart in the field of machine learning, where MKL methods were being developed to integrate multiple sources of information,e.g. in sensor fusion problems. Since then, a staggering amount of MKL-based algorithms were proposed to solve biomedical problems using a wide array of similarity measures over genes, proteins, drugsetc.[95, 97].

A more recent attempt to cope with heterogeneity utilizes equivalence relations, and began to renew bioinformatics research under the aliases of systems biology and network biology. More “re-lational” than “descriptive” in nature, these methods focused not on the properties of the entities themselves but on their complex interaction patterns. This viewpoint also has its technical counter-part in the much more mature framework of graph theory, which has since became indispensable in modern biomedical (or social, telecommunicationsetc.) research.

These trends are especially prevalent in the field ofin silico drug discovery. Multiple chemical similarities have been employed in virtual compound screening studies for more than a decade [21].

Equivalence relations also arise naturally in a number of scenarios,e.g. drug combinations, interac-tions or shared indicainterac-tions. However, combining these fundamentally different sources of informa-tion in a mathematically sound framework remained an open challenge. In this Chapter, we propose a Distance Metric Learning framework which is capable of incorporating such dual priors.

The author’s contributions are as follows. Section 3.2 presents the primal and the derivation of the dual optimization problem (Thesis II.1). Section 3.3 describes the details of the algorithm and its implementation on specialized hardware, namely graphics processors (Thesis II.2). Section 3.4 presents an application and evaluation in drug repositioning including a comparison with earlier methods (Thesis II.3). The material in this Chapter is based on the following publication [101]:

• B. Bolgár and P. Antal. Towards Multipurpose Drug Repositioning: Fusion of Multiple Kernels and Partial Equivalence Relations Using GPU-accelerated Metric Learning, pages 36–39. Springer Singapore, Singapore, 2015. ISBN 978-981-287-573-0. doi: 10.1007/978-981-287-573-0_9. URL https: // doi. org/ 10. 1007/ 978-981-287-573-0\ _9.

Unified

representation Quantitative

evaluation Prediction

Distance Metric Learning

Incomplete and inconsistent

relations

Multiple similarities

(kernels)

“Pre-clustering” RKHS embedding

Hardware (GPU/CPU)

Systems biology Multiple ’omic levels Biomedical knowledge

Figure 3.1. Overview of the work presented in this thesis. Two types of background information, equivalence relations and similarities are combined in the joint framework of Distance Metric Learn-ing and subsequently used for drug repositionLearn-ing by drug prioritization.

3.1.1 Distance Metric Learning

Definition 3.1.1. LetX be a set together with a functiond:X × X →R^≥0which satisfies

• d(x_i,x_j)≥0andd(x_i,x_j) = 0⇔x_i =x_j(positive definite)

• d(xi,xj) =d(xj,xi)(symmetric),

• d(xi,xj) +d(xj,x_k)≥d(xi,x_k)(triangle inequality)

for allx_i,x_j,x_k∈ X. ThenX is called ametric spacewith themetricd.

Example 3.1.1. LetX be a Banach space (i.e.a complete normed vector space). The metric induced by the normk·konX is

d(xi,xj) =kxi−xjk.

`^pspaces are special cases of the former with thep-norm defined as in Section 2.1.1, and the`^pdistance is

d(xi,xj) =kxi−xjk_p withp >1.

Example 3.1.2. LetX denote the vector space ofm×nmatricesR^m×nwith the Frobenius inner product

hA,Bi_F = Tr(AB^T).

forA,B∈ X. The induced metric is given by d(A,B) =

hA−B,A−Bi_F.

Example 3.1.3. LetX be the vector spaceR^D. Using the Frobenius inner product, the Mahalanobis distance ofxi,xj ∈Xcan be written as

d(x_i,x_j) =kLx_i−Lx_jk

=rD

(x_i−x_j),W(x_i−x_j)E

=rD

W,(xi−xj)⊗(xi−xj)E

withW =L^TL, henceW is positive definite. The Mahalanobis distance is the standard Euclidean distance after applying a linear transformationL.

Remark. In general,Xcan carry much more structure than a vector space, which can also be exploited by the distance functions (e.g. graphs, trees, stringsetc.). However, in this Chapter we restrict our attention toR^Dand, ultimately, the linear setting, although non-linearization tricks will also be used.

In a sense, the Mahalanobis distance has the most general form as it can be seen as the Euclidean version of the arc length on a Riemannian manifoldX

L(γ) = Z 1

hγ(t),˙ γ(t)˙ iWdt= Z 1

Wij(t) ˙γⁱ(t) ˙γ^j(t)dt,

whereγ : [0,1]→ Xis a curve onX,γ˙(t)∈T_γ(t)Xis its tangent vector atγ(t)andW is the metric tensor (i.e. a smooth symmetric positive definite covariant rank-2 tensor field onX). If we takeX to beR^D andγ to be a geodesic curve (i.e. a straight line), and choose a global coordinate system, we get exactly the formula of the Mahalanobis distance. Learning a metric of this form is by far the most popular approach to DML. Recently, there has been some work on local DML methods which keep the “infinitesmial” Mahalanobis metrics given at each point [102], or change the topology and geometry ofX in such a way that the integral remains tractable [103].

The goal of DML, with its roots in dimensionality reduction, Sammon mapping and topology pre-serving mapping methods, is to learn a problem-specific optimal metric (usually the matrix of the Mahalanobis distance), tailored to the machine learning task at hand. In most cases, this is conceptu-alized as a supervised learning problem and solved using constrained optimization methods. From a more practical viewpoint, it can also be thought of as “inverse clustering”, where a combined measure of distance is learned from a set of base metrics with respect to prior pairwise equivalence and in-equivalence relations. In the next Section, we review earlier approaches directly relevant to our work;

an extensive review of the literature can be found in [104]. We adapted the notations for convenience and easier comparability.

3.1.2 Earlier works

In most supervised DML formulations, training examples enter into the problem as a priori con-straints. These can take the form of

• Pairwise constraints: equivalence relations (must-link and cannot-link constraints between en-tities),

• Triplet-based constraints: relative distances (e.g. x_i should be closer tox_jthanx_k).

This approach is often used in conjuction with a suitable regularization term on the Mahalanobis matrix, yielding an SVM-like optimization problem [105]. This resemblance is even more obvious when the constraints are “softened” by introducing slack variables, very much like in soft-margin SVM formulations. However, unlike in SVMs, extra care must be taken to ensure the positive definite property of the metric, which is the most challenging part. Solutions include projecting back onto the PSD cone in each iteration [106], using carefully chosen divergence measures in the regularizer [107], or neglecting the property alltogether [108].

3.1.2.1 Large-Margin Nearest Neighbors

Using the ideas fromkN Nclassification, Large-Margin Nearest Neighbors (LMNN) utilizes both types of constraints andk-neighborhood memberships [109]:

S ={(xi,xj) :yi=yjandxj ∈kN N(xi)} R={(x_i,x_j,x_k) : (x_i,x_j)∈ S andy_i6=y_k} The objective function is

minW,ξ (1−λ) X

xi,xj∈S

d(x_i,x_j) +λX

ξ_ijk

s.t. d(xi,xk)−d(xi,xj)≥1−ξijk, ∀xi,xj,xk∈ R,

whereddenotes the squared Mahalanobis distance with matrixW andξ_ijk are slack variables. In-tuitively, the objective is to pull desired neighbors together while keeping “imposters” away; the trade-off is controlled byλ. The problem is solved using subgradient descent.

3.1.2.2 Information Theoretic Metric Learning

Information Theoretic Metric Learning (ITML) utilizes pairwise constraints and the LogDet diver-gence in the regularizer, which is rank-preserving and therefore can be used to enforce the positive definite property automatically [107]. The primal is

minW,ξ λD(W,W⁰) +X

ξij

s.t. d(xi,xj)≤u+ξij ∀xi,xj ∈ S, d(x_i,x_k)≥v−ξ_ik ∀x_i,x_k∈ D,

whereD(W,W⁰) = Tr(W W⁰)−log det(W W⁰)−dim(W)is the LogDet divergence,ξijare the slack variables,dis the squared Mahalanobis distance with matrixW,SandDare the sets of must-link and cannot-must-link pairs, respectively, andλ, u, v are pararmeters. In the experiments, the prior metricW⁰is chosen to be the identity matrix which ensures that the learning metric is full-rank and avoids overfitting while conforming to the constraints as much as possible. The parameterλcontrols the trade-off between these two goals. The problem is solved using iterative Bregman projections.

3.1.2.3 The KPCA trick

There has been a significant amount of work on nonlinearizing earlier DML methods, most notably, via the kernel trick. Analogously to kernel PCA, by parameterizingW = (FΦ)^TFΦ, we get a

nonlinear version of the algorithm which can be expressed solely in terms of inner products. In particular,

d²(x_i,x_j) =FΦ(φ(x_i)−φ(x_j))²

F^TF,(x˜i−x˜j)⊗(˜xi−x˜j)E

whereΦis the matrix[φ(x1)φ(x2) · · · φ(xn)]^T and thus the transformed samplesx˜correspond to the rows of a suitable Gram matrix, whereφ : R^D → X andX is the corresponding Reproducing Kernel Hilbert Space. This trick is called the KPCA trick and comes with its own version of the Kimeldorf–Wahba Representer Theorem, proven by Chatpatanasiriet al.[110]:

Theorem 3.1.1(Representer Theorem). LetX,Y be separable Hilbert spaces andψi ∈ X such that span({ψ_i}^Pi=1) = span({φ(x_i)}^Pi=1). Let f be an objective function which depends solely on inner products of the formD

Lφ(x˜ _i),Lφ(x˜ _j)E

withL˜∈ B(X,Y)the space of bounded linear operators. Then

˜ min

L∈B(X,Y)

Lφ(x˜ 1),Lφ(x˜ 1)E , . . . ,D

Lφ(x˜ i),Lφ(x˜ j)E , . . .

(3.1)

has the same optimal value as min

L∈R^P^×Pf(x˜^T₁L^TLx˜1, . . . ,x˜^T_i L^TLx˜j, . . .), (3.2) wherex˜_i = [hφ(x_i), ψ₁i · · · hφ(x_i), ψ_Pi]^T.

Remark. The vectorx˜ ∈ R^P is called therepresenter φ(xi). The Representer Theorem allows one to express the solution of 3.1 in terms a P ×P-dimensional matrix, despite the fact that X may be infinite dimensional. However, DML formulations involving certain types of regularizers need a stronger version of this theorem:

Theorem 3.1.2(Strong Representer Theorem). Letf and{ψi}^P_i=1be as before and introduce a regu-larized objective function

h(τ,φ) =f(hφ(x1), τ1)i, . . . ,hφ(xi), τji, . . .) + Xd

l=1

gl(kτlk), whereglis monotonically increasing. Then any solution to

arg min

τ∈B^×d(X,R)

h(τ,φ) must admit a representation

τl= XP i=1

αilψl.

Remark. If we takeLto be[τ1 · · ·τd]^T andψito beφ(xi), we have that anyτlin the solution must take the formΦα_l, which provides justification to the KPCA trick.

3.1.2.4 Multiple Kernel Distance Metric Learning

As mentioned before, many DML approaches bear a striking resemblance to SVMs, where Multiple Kernel Learning techniques have been investigated for a long time. Adapting these to improve predic-tive performance is a natural extension to earlier DML methods. Independently of our work, several formulations were developed in the past few years. The primal of the MKL-SVM algorithm developed by Luet al.[105] is

Wmin,m,b,ξ

1 2

kW_kk²F

+CX

ξ_i

s.t. yi

Wk,(x˜ik−z˜ik)⊗(x˜ik−z˜ik)E +b

≥1−ξi

ξi≥0, Wk0,

whereW_k is the matrix of thekth Mahalanobis distance,ξ_i are the slack variables andC controls complexity. The training set consists of pairwise constraints{(x_ik,z_ik, y_i)}^P,R_i=1,k=1, wherekindexes the information sources andiindexes the training pairs. The labelyi = +1if theith training pair should be pulled together andy_i=−1if they should be pushed apart.

Another similar formulation is by Wanget al.[108] which utilizes the divergence idea from ITML, triplet constraints and an`^pregularizer on the weights:

W,m,ξmin 1 2

W_k−W_k⁰²

m_k +CX

ξ_ijk+λ 2 kmk²p

s.t. d(xi,xl)−d(xi,xj)≥1−ξijl, ξijl≥0, mk>0

withp > 1, however, this formulation neglects the positive definite property of the Mahalanobis matrix.

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 47-52)