Background - Multiple kernel data and knowledge fusion methods in drug discovery

The growth of drug–target interaction (DTI) datasets is limited by the time-consuming and costly na-ture of the experiments. Such restrictions do not apply to the cross-linked repertoire of complemen-tary “side information”, which continues to expand at an enormous rate, including chemical structure, patient-reported and official side effects, off-label drug usage patterns, protein-protein networks and other chemo- and bioinformatics resources. Similarly to drug repositioning, DTI prediction can also benefit greatly from utilizing these extra information sources, which makes computational data and knowledge fusion approaches especially relevant. In this Chapter, we investigate Bayesian multiple kernel-based fusion approaches to the DTI task from a computational fusion perspective, and evaluate them using benchmark datasets, implementations and evaluation methodologies from Yamanishiet al.[115], Gönen [116], Pahikkalaet al.[68] and Liuet al.[89].

The author’s contributions are as follows. Section 4.2 presents a Bayesian logistic matrix factor-ization method, which unifies multiple kernel learning, importance weighting for positive observa-tions, Laplacian regularization and explicit modeling of probabilities of drug–target interactions sis III.1a). It also introduces a Bayesian variational approximation scheme for efficient inference (The-sis III.1b). Section 4.3 describes an alternative model which incorporates multiple sources of side in-formation via Gaussian process priors and contains a novel missing not at random (MNAR) model for the pharmaceutical data (Thesis III.2). Section 4.4 demonstrates that the methods significantly outper-form earlier solutions and investigates the reasons behind their top peroutper-formance (Thesis III.3). Finally, Section 4.5 shows that the logistic matrix factorization method can be utilized to predict promiscuity and druggability (Thesis III.4). The material in this Chapter is based on the following publications [70]:

• B. Bolgár and P. Antal. Bayesian matrix factorization with non-random missing data using infor-mative Gaussian process priors and soft evidences. In Alessandro Antonucci, Giorgio Corani, and Cassio Polpo Campos, editors, Proceedings of the Eighth International Conference on Probabilistic Graphical Models, pages 25–36, 2016.

• B. Bolgár and P. Antal. VB-MK-LMF: Variational Bayesian Multiple Kernel Logistic Matrix Factor-ization. BMC Bioinformatics, 2017, in press.

Descriptions:

• Chemical

• Target-based

• Side-effect, MoA Similarities:

• Kernel type/params Prior expectations:

• Drug-likeliness

• Promiscuity

• Number of hits in task

Descriptions:

• Sequence, GO, PPI

• Pockets, binding sites Similarities:

• Kernel type/params Prior expectations:

• Druggability

• Docking

• Other predictions

• Private data

• Public/benchmark

• Private

• Commercial

• Multiple kernels

• Hyperparameters

• Augmentation

• Latent dimensions

• Weight prior

• Iterations

• Laplacian regularization

• Interaction probabilities

• Kernel weights

• Missingness

• MCMC

• Variational Bayes

• GPU acceleration

• Probabilistic predictions

• Promiscuity

• Credibility regions

• Interaction graph

• Parallel coords

• Annotations

• Large-dimensional tools

• AUROC/AUPRC

• Effects of priors

• Number of latent factors

• Kernel weights Drugs/compounds

Targets

Interactions DTI data

Prior knowledge

Parameters

Probabilistic model Solver Prediction

Visualization Evaluation

Figure 4.1. Overview of the matrix factorization workflow for DTI prediction. A priori information (left) are combined with DTI data through a Bayesian model (middle). Learning the latent factors and optimal kernel weights is carried out using approximate inference methods. The models provide quantitative predictions of interactions (right). Gray color indicates functionalities which may also be utilized in the probabilistic models but not explored in this dissertation.

4.1.1 Matrix factorization for the analysis of dyadic data

In machine learning, matrix factorization-based methods have become a well-established and pow-erful approach to analyze dyadic data. The idea is to find a low-rank approximation for a matrix of observationsR∈R^I×J as a product of factorsU ∈R^L×IandV ∈R^L×J, such that

R≈U^TV,

whereL rank(R), called the dimension of the latent representations. The usual interpretation is to think of the columns ofU as representations ofI “row-entities” (e.g. drugs), the columns ofV as representations ofJ “column-entities” (e.g. targets), whileRcontainsI×J interaction scores (e.g.

bioactivity values). In most works, the Frobenius distance is employed in the loss function,i.e. the objective is

minU,V

R−U^TV²

F, (4.1)

although it can be replaced with other matrix divergencese.g.for non-negative representations [117].

Singular value decomposition (SVD) provides the unique and optimal solution for this problem, and it is easily tweaked to handle missing observations by reformulating in an element-wise manner and keeping only the terms whereRij is known:

minU,V

{(i,j):Rijknowni

(Rij−u^T_i vj)²,

whereui andvj denote theith column ofU and thejth column ofV, respectively. As long as no zero rows or columns are present inR, the missing entries can be predicted by simply multiplying

UU U^T

L VVV

UUU^TVVV I

J KKK^u_n

KKK^v_m J

RRR

Figure 4.2. Matrix factorization with side information. The product of the factorsU andV forms a low-rank approximation to the matrixR. The columns of the factors are learned in a way that takes a prioriknowledge, in the form of kernel matrices, into account.

the corresponding latent representations:

R_ij ∼u^T_iv_j.

Remark. This idea has much more to it than it seems at the first glance. Originated in the recommender systems community, where they were applied to movie recommendation [118], matrix factorization methods have revolutionized the industry, despite their apparent simplicity. It is still a mystery, at least to the author, why they work so well. It seems that the basic mathematical idea lies at the heart of many recent breakthroughs not only in machine learning but quantum chemistry, quantum field the-ory and some theories of quantum gravity as well. In particular, matrix factorization can be straight-forwardly extended to higher-order complex tensors (under the alias of Schmidt decomposition in the quantum computing world). Then the exact same idea can be applied to calculate quantum states in strongly interacting quantum many-body systems, using a variational approximation. Applying the decomposition repeatedly amounts to constructing a tensor network where the edges denote tensor contractions, and the number of latent dimensions correspond to the bond dimension; in the physical sense, this process can also be viewed as a numerical renormalization technique. Computing ground states in such systems is essentially the same as “learning” the values in the tensors in the network, very much like training a deep learning model (seee.g. TensorFlow [119]; for a concrete application of deep learning to represent quantum many-body states, seee.g.[120]). Recently, one particular type of tensor networks, the Multi-scale Entanglement Renormalization Ansatz (MERA) has been shown to correspond to a discrete version of an Anti-de Sitter (AdS) spacetime, with entanglement entropy corresponding to geodesics in the network, suggesting a deep connection coined AdS/MERA corre-spondence, also connecting the theory to conformal field theories. Verlinde extended this notion to de Sitter space, hinting at the holographic emergence of spacetime from quantum entanglement [121].

4.1.2 Earlier works

The basic matrix factorization model faces multiple challenges. First, as the values in the factors can get arbitrarily large, the basic model is prone to overfitting; second, it is often the case that one has additional information about the row- and column-entities alongside their interaction values, which should be incorporated in the objective function. The overfitting issue can be remedied in two ways: either by introducing a generative model with appropriate parameter priors, or by casting the problem (4.1) into the regularized risk minimization framework,i.e. adding a regularization term. In fact, the latter approach can be seen as a MAP approximation to the former, as we have illustrated in Example 1.2.5. The probabilistic framework also presents opportunities to incorporate additional information (“side information”) about the entities. In this Section, we overview earlier approaches to probabilistic matrix factorization and the utilization of side information.

4.1.2.1 Bayesian Probabilistic Matrix Factorization

Probabilistic Matrix Factorization (PMF) by Salakhutdinovet al.[122] is the probabilistic version of the basic MF model. It also can be seen as a “two-side” extension of the Bayesian linear regression model; formally, dot products of the columns of the factor matricesu^T_iv_j play the role ofw^Tφ(x), but in this case, in a symmetric fashion,i.e.both sides of the product are learned, not just the weight vectorw. In particular, PMF places a Gaussian noise model onR:

Rij =u^T_i vj+ε, ε∼ N(ε|0, β⁻¹),

and treats the columns ofU andV as zero-mean multivariate normal variables. Similarly to Exam-ples 1.2.4 and 1.2.5, the model can be fully specified as

p(R|U,V) =Y

N(Rij|u^T_i vj, β⁻¹)Iij

, (4.2)

p(U|α^u) =Y

p(u_i|0, α^u−1I), (4.3)

p(V|α^v) =Y

p(vj|0, α^v−1I), (4.4)

whereIij = 1wheneverRij is known and0otherwise. Using the negative log-likelihood, the MAP solution corresponds to solving the regularized risk minimization problem

minU,V

1 2

Iij(Rij−u^T_i vj)²+α^u β

kuik²+α^v β

kvjk².

The objective function is optimized using gradient descent. PMF was later extended to hierarchical Bayesian model (BPMF) by putting Normal-Wishart hyperpriors onto the parameters of PMF [123]:

p(U|µ^u,S^u) =Y

N(ui|µ^u,S^u−1), (4.5)

p(µ^u,S^u|µ^u⁰, κ^u,S^u⁰, ν^u) =N W(µ^u,S^u|µ^u⁰, κ^u,S^u⁰, ν^u), (4.6) and symmetrically forV. Since this model is conditionally conjugate, Gibbs sampling can be utilized to get the posteriors forU,V and all updates are available in closed form. In particular, by the usual

α^u

U UU

α^v

VVV

β RRR

µµµ^u SSS^u

UUU

µµµ^v SSS^v

VVV

β RRR

SSS^v⁰ ν^v µµµ^v⁰ κ^v SSS^u⁰ ν^u µµµ^u⁰ κ^u

Figure 4.3. Graphical models corresponding to PMF (left) and BPMF (right). Nodes represent random variables and edges represent conditional dependence relations.

update rules for the conjugate priors (which we will utilize later), the conditionals are given by p(U|Θ\U) =Y

N(u_i|µ^∗_i,S^∗_i⁻¹), S_i^∗=S^u+βX

Iijvjv_j^T, (4.7)

µ^∗_i =S_i^∗⁻¹



βX

IijRijvj+S^uµ^u



, (4.8)

p(µ^u,S^u|Θ\ {µ^u,S^u}) =N W(µ^u,S^u|µ⁰^∗, κ^∗,S⁰^∗, ν^∗), µ⁰^∗= κ^u⁰µ^u⁰ +Iu¯

κ^u⁰+I , (4.9)

κ^∗=κ^u⁰ +I, (4.10)

S⁰^∗−1

S^u⁰−1

+IU¯ + Iκ^u⁰ I+κ^u⁰

µ^u⁰ −u¯ µ^u⁰−u¯T

, (4.11)

ν^∗=ν^u⁰ +I, (4.12)

u¯= 1 I

ui, U¯ = 1 I

uiu^T_i .

whereΘdenotes all variables in the model.V can be updated similarly. PMF and BPMF are illustrated on Figure 4.3.

4.1.2.2 Neighborhood Regularized Logistic Matrix Factorization

Neighborhood Regularized Logistic Matrix Factorization (NRLMF) by Liuet al.[89] also uses MAP approximation, but with a binary interaction matrix, which enables using the logistic matrix fac-torization technique by Johnson [124]. Instead of directly modeling real-valued interaction scores,

logistic matrix factorization estimates the probability of binary interactions between the entities. The distribution ofRis given by an augmented version of the Bernoulli distribution which scales positive observation by a factorc:

p(R|U,V) =Y

σ u^T_i v_jcRij

1−σ u^T_i v_j^1−Rij

whereσ is the logistic sigmoid function (cf. Example 1.2.6). Similarly to PMF, NRLMF puts a zero-mean Gaussian prior on the columns of the factors as in (4.3), (4.4). Building on the work of Belkinet al.[125] on manifold regularization and its application to DTI prediction by Xiaet al.[126], NRLMF utilizes the Laplacian regularizer to give the regularized risk minimization problem

minU,V

(1 +cRij−Rij) ln

1 + exp u^T_i vj

−cRiju^T_i vj

+ tr

U^T α^u

2 I+γ^u 2 L^u

+ tr

V^T

α^v 2 I+ γ^v

2 L^v

whereL^u,L^v are the graph Laplacian matrices computed froma priorisimilarity matricesK^uand K^v. In particular,L^u =D^u−K^u, whereD^uis a diagonal matrix containing the column sums of the symmetrized similarity matrixK^uon its diagonal. Intuitively, this forces the latent representations to followa prioridefined similarities. In the NRLMF algorithm, the Laplacian is truncated to contain only a small number of nearest neighbors, which has been found to improve predictive performance.

After the training is complete, interaction probabilities are either estimated using the latent factors asσ(u^T_ivj) or, if no positive observations are present for the drug and target corresponding to a particular interaction, a weighted average is calculated from the latent representations of the most similar drugs and targets.

Remark. This regularization scheme has been extensively studied under the alias of manifold regular-ization, which has its own version of the Representer Theorem. This regularization approach utilizes the Dirichlet energy for regularizingf

E(f) = Z

Xh∇f,∇fidP(x) = Z

Xhf,∆_Kfidx

The graph LaplacianLis a discrete version of the Laplace–Beltrami operator∆_K, which one getse.g.

by looking for weak solutions of the Poisson equation using the Galerkin method. In particular,f^TLf is just a discrete version of the Dirichlet energy of a functionf defined at some vertices indexed byi

1 2

Kij(fi−fj)²=X

Kij −X

Kijfifj =f^TLf ∼ hf,∆Kfi, (4.13) whereL =D−K as before. The Laplace–Beltrami operator, and its generalization to differential forms on Riemannian manifolds, the Laplace–de Rham operator has always interesting things to tell us about the shape of the underlying manifold. Harmonic forms,i.e. for which ∆_Kω = 0 are in direct correspondence with de Rham cohomology groups, which detect “holes” in the manifold, very much like their discrete counterparts in topological data analysis mentioned in Section 1.2. Spectral properties of the graph Laplacian are also heavily used in machine learning, such as diffusion kernel methods [127] and spectral clustering [128]; in the continuous realm, a related interesting question has been raised by Kac as “Can one hear the shape of the drum?” [129] (answer: in general, no, as the spectrum of the Laplace operator is not determined uniquely by the shape).

4.1.2.3 Kronecker regularized least squares approach with Multiple Kernel Learning Instead of matrix factorization, the KronRLS-MKL algorithm by Nascimentoet al.[76] takes the pair-wise kernel approach (Section 1.3.3.1). It starts from the Regularized Least Squares (RLS) problem which we already described in Example 1.2.5, and uses the RLS method in conjunction with a pair-wise kernel, constructed from the combined drug–drug kernelK^d = P

mγ_mK_m^d and combined target–target kernelK^tusing their Kronecker product. To make this process feasible, they apply a computational trick involving their eigendecompositions. The decision function takes the form

f(X) =K^tQ^t

Λ^d⊗Λ^t+λI−1

vecn

Q^tTY^TQ^do

K^dQ^dT,

whereX is the matrix of samples, Y is the matrix of their respective values,K^d = Q^dΛ^dQ^dT is the eigendecomposition of the drug–drug kernel, and the same goes for the target–target kernel.

They propose a two-step optimization procedure which solves the corresponding regularized risk minimization problem in an alternating fashion,i.e.with respect to the dual variables and the kernel weights separately.

4.1.2.4 Kernelized Bayesian Matrix Factorization with twin Multiple Kernel Learning Kernelized Bayesian Matrix Factorization with twin Multiple Kernel Learning (KBMF2MKL) by Gö-nenet al.[33] proposes a fully Bayesian probabilistic model consisting of four modules.

1. The kernel-based dimensionality reduction module utilizes the KPCA trick and learns a projec-tion matrix which projects the columns of the kernel matrix into a lower-dimensional space.

We overview the module focusing on only one side (e.g.drugs); the “twin” side, as indicated by the name, is formulated exactly the same way. The module utilizes conjugate priors,i.e.the el-ements of the projection matrixAcome from a zero-mean Normal distribution with a Gamma prior on the precision

p(A_ik|β_ik) =N A_ik|0, β_ik⁻¹ , p(β_ik|a, b) =Ga(β_ik|a, b), and the projections are given by

p(P_m,il|A,K) =N P_m,il|K_m,l^Ta_i, σ² ,

wherea_iis theith column of the projection matrix andK_m,lis thelth column ofmth kernel matrix.

2. The MKL module introduces weights γ on projections coming from different kernels. The weights also follow zero-mean Normal-Gamma distributions

p γm|β⁰_m

γm|0, β_m⁰ ⁻¹ , p β_m⁰ |a⁰, b⁰

=Ga β_m⁰ |a⁰, b⁰ ,

where thea⁰ andb⁰ can be used to control kernel-level sparsity. The weighted projectionsH are given by

p(H_il|P,γ) =N h_ilX

γmP_m,il, σ⁰²

! .

3. The matrix factorization module computes predicted outputs as inner products of the weighted projections:

p fij|H,H⁰

=N fij|h^T_i h⁰_j,1 ,

wherehi andh⁰_j are columns of the matrices containing weighted projections, coming from different sides of the model.

4. The classification module givesyij = +1wheneverfij > ν and−1otherwise.

Since the model is conditionally conjugate, Gibbs sampling or mean field variational inference can be straightforwardly utilized, as the updates are available in closed form. For computational efficiency, Gönenet al.choose the variational method.

4.1.2.5 Macau

Macau by Simm et al.[69] extends the matrix factorization framework to tensor factorization by replacingRwith a higher order tensor. The distribution ofRis given by

p R|U¹,U², . . . ,U^M, α

= Y

RIknown

N RI|δ(uI), α⁻¹ ,

whereI is a multi-index(i1, i2, . . . , iM),δis a multilinear map which sums diagonal components of u_I=ui1⊗ · · · ⊗ui_M, anduimis theimth column of the factorUm. The latent representations come from a hierarchical Bayesian linear model

p(u_i_m|x_i,W,b,Λ) =N u_i_m|W^Tφ_m(x_i) +b,Λ⁻¹ ,

with a Normal–Wishart prior on the parametersbandΛ, as in Equation (4.6). Intuitively, this en-sures that whenever there are only a handful of observations (or none at all) in the interaction matrix corresponding this particular entity, the latent representations are heavily influenced by the features φ_m(x); their contribution is more modest if there are many observations. The same idea can be utilized to incorporate features about the interactions themselves, coined “relation features”. Addi-tionally, Macau imposes a zero-mean Gaussian prior on the columns of the weight matrixW as

p(W|Λ,λ) =N

vec(W)|0,(Λ⊗λI)⁻¹ , p(λ|a, b) =Ga(λ|a, b).

Similarly to BPMF, Macau uses Gibbs sampling for full Bayesian inference.

4.2 Variational Bayesian Multiple Kernel Logistic Matrix

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 63-70)