Background - Multiple kernel data and knowledge fusion methods in drug discovery

Computational drug repositioning refers to finding new indications for an existing drug by means of in silicomethods. In this Chapter, we utilize the machinery introduced in Sections 1.2 and 1.3 and present a kernel-based methodology for computational drug repositioning. In particular, we utilize multiple representations of drug molecules and extend the kernel framework to handle such multi-view data in a statistically optimal way. We adapt one-class Support Vector Machines to perform prioritization or ranking of drugs and explore the benefits and limits of this strategy. Analogously to ligand-based virtual screening methods, by characterizing indications with training sets consisting of molecules applied in them, the resulting prioritized lists can be utilized to predict novel drugs for an indication or novel indications for a drug.

The author’s contributions are as follows. Section 2.2 presents the adaptation of`^p-regularized MKL to the prioritization problem in drug repositioning (Thesis I.1). Section 2.3 includes a theoretical analysis of the one-class MKL framework and experiments on chemical structural data (Thesis I.2).

Section 2.4 describes the suggested workflow for computational drug repositioning. Results obtained in collaboration with other researchers are not presented in this dissertation; we refer to the original papers instead. The material in this Chapter is based on the following publications [36, 91, 92]:

• B. Bolgar, A. Arany, G. Temesi, B. Balogh, P. Antal, and P. Matyus. Drug repositioning for treatment of movement disorders: from serendipity to rational discovery strategies. Curr Top Med Chem, 13(18):2337–2363, 2013

• A. Arany, B. Bolgar, B. Balogh, P. Antal, and P. Matyus. Multi-aspect candidates for repositioning:

data fusion methods using heterogeneous information sources. Curr. Med. Chem., 20(1):95–107, 2013.

• G. Temesi, B. Bolgar, A. Arany, C. Szalai, P. Antal, and P. Mátyus. Early repositioning through compound set enrichment analysis: a knowledge-recycling strategy. Future medicinal chemistry, 6(5):563–575, 2014.

2.1.1 Data fusion through linear kernel combinations

In Section 1.2.1.1, we gave an overview of Reproducing Kernel Hilbert Spaces and kernel methods in general; here we extend the framework to multiple information sources. There has been a large

amount of work on developing Multiple Kernel Learning techniques with improved accuracy and fa-vorable scaling properties, with a special focus on data fusion. Fixed combination strategies, such as summing or multiplying kernels were applied as early as the work of Pavlidiset al.[3], who coined it “intermediate” data fusion. Another strategy, called kernel alignment, and its centered version has attracted considerable attention lately as it was shown to outperform uniform and linear ker-nel combinations [93]. However, the central challenge is incorporating the combination scheme into the learning process, which is usually done through regularized risk minimization or Bayesian ap-proaches. In this Chapter, we take the former; in Chapter 4, we will develop a Bayesian technique as well. A comprehensive overview and detailed comparison of Multiple Kernel Learning methods can be found in [94].

Regularized risk minimization casts the learning problem into a constrained optimization frame-work, which now includes both the base learner and the kernel combination strategy. Despite their limitations, which we will touch on briefly in Section 2.3, the overwhelming majority of the algorithms use linear kernel combinations, and make the kernel weights a part of the optimization problem. From the theoretical viewpoint, this is a perfectly reasonable choice. Suppose the combined kernel function kis given as a linear combination of base kernel functionsk_kwith non-negative weightsm_k:

k(xi,xj) = XR k=1

m_kk_k(xi,xj).

Using the combined reproducing kernel map

φ(x) = [√m₁φ₁(x),√m₂φ₂(x), . . . ,√m_Rφ_R(x)]^T

we see that the linear combination corresponds to a weighted concatenation of the images of the samplesx_i,x_j ∈ X under the individual reproducing kernel mapsφ_k:X → Fk:

hφ(x_i), φ(x_j)iF = XR k=1

√m_k²hφ_k(x_i), φ_k(x_j)iF_k = XR k=1

m_kk_k(x_i,x_j).

In other words, by concatenation, we get the combined Reproducing Kernel Hilbert Space F =F1⊕ F2⊕ · · · ⊕ FR.

This is theoretically appealing, as it comes with its own version of the Representer Theorem with essentially the same proof.

The only remaining ingredient is the regularization term on the kernel weights. Usually, the goal is to formulate the optimization problem as a convex objective function over a convex set of base kernels. A fairly general regularizer which covers most of the algorithms is the`^p-norm

kmk_p= X

m^p_k

!¹

Finally, here we include Hölder’s inequality, as its application is usually the key step of the derivation of the dual objective.

Theorem 2.1.1(Hölder’s inequality). Letf, g:Rⁿ→RLebesgue-measurable andp, q∈[1,∞]with

p +¹_q = 1. Then

kf gk1≤ kfkpkgkq.

2.1.1.1 QCQP formulation

Lanckrietet al. were the first to consider MKL for knowledge and data fusion, named genomic data fusion, by deriving kernels froma prioribackground knowledge. They investigated kernel fusion as a linear combination of base kernels, where the SVM model and kernel weights are learned simultane-ously in the same optimization problem [95]. For positive definite kernels, this leads to a semidefinite programming (SDP) problem and the dual is modified as

minm max

α 1^Tα−1 2

α^TQ_kα

s.t. 0≤α≤C, y^Tα= 0, tr X

mkQk

=c X

m_kQ_k0,

wherecis a constant. Restricting to convex combinations, a special case of the former can be formu-lated as a quadratically constrained quadratic program (QCQP):

maxα 1^Tα−cτ s.t. τ ≥ 1

Pα^TQkα

0≤α≤C, y^Tα= 0, which is easily solved by many out-of-the-box optimization packages.

2.1.1.2 `¹formulation

SimpleMKL [96] takes a different approach by re-formulating the primal as a convex optimization problem:

wmink,b,ξ,d

1 2

kwkk²

m_k +CX

ξi

s.t. yi

w_k^Tφ(xi) +b

≥1−ξi, ξ≥0, X

mk= 1, m≥0.

The corresponding dual is difficult to optimize, hence, they propose a gradient-based algorithm which solves the primal for the weights and calls an inner routine to solve a standard SVM in each iteration.

2.1.1.3 `^p formulations

Yuet al.generalize the problem to`^pregularization on the kernel weights [97]:

minm max

α 1^Tα−1 2

m_kα^TQα s.t. 0≤α≤C, y^Tα= 0,

kmkp = 1, m≥0.

ForQ_k 0, this leads to a dual very similar to the earlier QCQP formulation, but now with the`^q regularizer

maxα 1^Tα−τ s.t. τ ≥ X

α^TQkαq

!¹

0≤α≤C, y^Tα= 0,

where ¹_p + ¹_q = 1. Importantly, they also observe that`² regularization offers superior predictive performance in biomedical problems, as opposed to sparse combinations.

Vishwanathan et al. slightly modify the primal by switching to Tikhonov regularization and squared`^pnorm, which gives [98]

w,b,ξ,mmin 1 2

kwkk² mk

+CX

ξi+λ 2kmk²p

s.t. yi

w^T_kφ_k(xi) +b

≥1−ξi, m≥0, whereλis introduced as a trade-off parameter. The dual reads

maxα 1^Tα− 1 8λ

α^TQkαq

!²_q

s.t. 0≤α≤C, y^Tα= 0,

which is differentiable inα, enabling the application of the very efficient SMO algorithm.

2.1.2 Prioritization with one-class SVMs

One-class Support Vector Machines were introduced by Schölkopfet al. to estimate small regions in a high-dimensional vector space where the training samples lie [9]. This is done in the RKHS by computing the hyperplane which separates these points from the origin with maximum margin, as illustrated on Figure 2.1. In particular, they solve the quadratic program

w,ξ,ρmin 1

2kwk²−ρ+ 1 νP

ξi

s.t. w^Tφ(xi)≥ρ−ξi, ξ≥0,

E₂⁰

E₁⁰ E₃⁰ Q⁰₁

Q⁰₂ Q⁰₃ ρ²

ρ³ Q1 Q2

H I

k(·,·)

Figure 2.1. Prioritization with one-class SVMs. Let us assume that the kernel matrix has constant diagonal,i.e.images of the samples lie on the surface of a hypersphere. Training samples denoted by Qare separated from the origin with maximum margin. Test samples denoted byEcan be ranked according to the distance between the origin and their respective projections onto the hyperplane normal in the RKHSH, yielding the rankingE2,E3,E1.

where the margin is denoted byρand the model complexity is controlled byν(which is also a lower bound on the fraction of support vectors and an upper bound on the fraction of outliers). The dual reads

maxα −1

2α^TKα

0≤α≤1, 1^Tα=νP.

Moreauet al. proposed an intuitive measure of overall “similarity” of any sample to the training set as the distance between the origin and its projection onto the hyperplane normal:

f(x) = P

iα_ik(x_i,x)

√α^TKα ,

which can be used to prioritize entities,e.g.gene prioritization [97, 99].

2.1.3 Enrichment analysis

In the last decade, set enrichment analysis methods have become hugely popular in interpreting high-dimensional, correlated experimental results. The idea is to assess whether the elements of a set,

called annotation, are significantly over-represented (“enriched”) among the entities in the results. A plethora of biomedical ontologies are readily available, which provide annotations over a wide range of entities, e.g. functional annotations over genes, classifications over drug moleculesetc. Priori-tized lists are especially well-suited for enrichment analyses; in this case, the question is modified as whether entities corresponding to a given annotation are enriched among the first hits. Built on a solid statistical background, the SaddleSum algorithm is capable of taking scores into account and provide accurate correctedp-values [100].

LetEdenote the entities in the prioritized list andAdenote the set of annotations. Eacha∈ A maps to a subset of entities viaE:A →2^E which selects entities in the list which are annotated by a. LetS :E →Rbe the scoring function of the prioritizer. A natural question to ask is whether the sum of scores for a given annotationa ∈ Aexceeds the sum of scores of|E(a)|randomly selected entities:

e∈E(a)

S(e)≥ X

s∈R_|E(a)|(S)

s··= ˆσ,

whereR_|E(a)|is a uniform random sampling operator. Thep-value forσˆis given by p(σ≥σ) =ˆ

Z ∞

ˆ σ

p(σ)dσ.

The SaddleSum algorithm approximates this quantity via saddle point approximation of the tail prob-abilities, where the probability density function is given by the Fourier inversion formula

p(σ) = 1 2π

Z ∞

−∞

exp{|E(a)|K(it)−itσ}dt,

whereK(t) = lnρ(t)is the cumulant generating function ofp. SaddleSum first solves ˆ

σ=|E(a)|K⁰(λ)

forλusing Newton’s method and obtainsp-values from the Lugannani–Rice approximation p(σ≥σ) = Φ(z) +ˆ

1 y −1

φ(z) +O(E(|a|)^3/2), y=λp

E(|a|)K⁰⁰(λ), z= sgn(λ)p

2 (λˆσ−E(|a|)K(λ)), whereφ(x) = exp

−x²/2 /√

2πandΦ(x) =R∞

x φ(t)dt. The enrichment value forais computed from itsp-value by applying Bonferroni correction.

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 33-38)