Variational Bayesian Multiple Kernel Logistic Matrix FactorizationFactorization

3. The matrix factorization module computes predicted outputs as inner products of the weighted projections:

p fij|H,H⁰

=N fij|h^T_i h⁰_j,1 ,

wherehi andh⁰_j are columns of the matrices containing weighted projections, coming from different sides of the model.

4. The classification module givesyij = +1wheneverfij > ν and−1otherwise.

Since the model is conditionally conjugate, Gibbs sampling or mean field variational inference can be straightforwardly utilized, as the updates are available in closed form. For computational efficiency, Gönenet al.choose the variational method.

4.1.2.5 Macau

Macau by Simm et al.[69] extends the matrix factorization framework to tensor factorization by replacingRwith a higher order tensor. The distribution ofRis given by

p R|U¹,U², . . . ,U^M, α

= Y

RIknown

N RI|δ(uI), α⁻¹ ,

whereI is a multi-index(i1, i2, . . . , iM),δis a multilinear map which sums diagonal components of u_I=ui1⊗ · · · ⊗ui_M, anduimis theimth column of the factorUm. The latent representations come from a hierarchical Bayesian linear model

p(u_i_m|x_i,W,b,Λ) =N u_i_m|W^Tφ_m(x_i) +b,Λ⁻¹ ,

with a Normal–Wishart prior on the parametersbandΛ, as in Equation (4.6). Intuitively, this en-sures that whenever there are only a handful of observations (or none at all) in the interaction matrix corresponding this particular entity, the latent representations are heavily influenced by the features φ_m(x); their contribution is more modest if there are many observations. The same idea can be utilized to incorporate features about the interactions themselves, coined “relation features”. Addi-tionally, Macau imposes a zero-mean Gaussian prior on the columns of the weight matrixW as

p(W|Λ,λ) =N

vec(W)|0,(Λ⊗λI)⁻¹ , p(λ|a, b) =Ga(λ|a, b).

Similarly to BPMF, Macau uses Gibbs sampling for full Bayesian inference.

4.2 Variational Bayesian Multiple Kernel Logistic Matrix

a

b

γγγ

α

K K K

^u_n

U U U

m m m

a

b

γγγ

α

K K K

^v_m

VVV

m m m

RRR

c

n m

Figure 4.4. Probabilistic graphical representation of the Bayesian multiple kernel logistic matrix fac-torization model.

4.2.1 Probabilistic model

LetR∈ {0,1}^I×Jdenote the matrix of the interactions, whereR_ij = 1indicates a known interaction between theith drug andjth target. As in Bayesian logistic regression, we put a Bernoulli distribution on eachR_ij with parameterσ(u^T_i v_j)whereσis the logistic sigmoid function andu_i,v_jare theith andjth columns of the respective factor matricesU ∈ R^L×IandV ∈ R^L×J. One can think ofu_i andvj asL-dimensional latent representations of theith drug and jth target, and thea posteriori probability of an interaction between them is modeled byσ(u^T_i vj).

Similarly to NRLMF, we utilize an augmented version of the Bernoulli distribution parameterized byc≥1which assigns higher importance to positive examples. As we mentioned in Section 4.1.2.2, NRLMF uses a post-training weighted average to infer interactions corresponding to empty rows and columns inR(i.e.these would have to be estimated without using any corresponding observations).

We account for these cases by introducing variablesm^u,m^v ∈ {0,1}indicating whether the row or column is empty. In such cases, we only want to use the side information in the prediction. Therefore, we formulate the conditional distribution of the interactions, given the factors and indicators, as

p(R|U,V, c,m^u,m^v)∝Y

σ u^T_i vj

cRij

1−σ u^T_i vj

1−R_ijim^u_im^v_j

. (4.14)

Specifying priors onU andV presents an opportunity to incorporate multiple sources of side information. In particular, we can use a Gaussian distribution with a weighted linear combination of kernel matricesKn,n = 1,2, . . . in the precision matrix, which is the probabilistic version of a combined Laplacian–L₂regularization scheme (cf. Eq. (4.13))

p(U|α^u, γ^u,K^u)∝Y

exp (

−1 2

γ_n^uK^u_n,ikkui−ukk² )

·Y

exp

−α^u 2 kuik²

. (4.15) The prior onVcan be written similarly. Similarly to KBMF2MKL, to automate the learning of the optimal value of kernel weightsγ_n^u, we impose Gamma priors on them:

p(γ_n^u|a, b) = b^a(γ_n^u)^a−1e^−bγⁿ^u

Γ(a) . (4.16)

4.2.2 Variational approximation

In this Section, we apply the machinery developed in Section 1.2.2.1. We are interested in the approx-imation to the posterior

p(U,V, γ^u, γ^v|R,K^u_n, a^u, b^u,K^v_n, a^v, b^v, α^u, α^v, c) by the variational distribution

q(U,V, γ^u, γ^v).

We can obtain such an approximation by maximizing a lower bound on the expectation p(R) =

p(R|U,V)p(U|γ^u)p(V|γ^v)p(γ^u)p(γ^v)dUdVdγ^udγ^v,

with respect toU,V,γ^u,γ^v, where we suppressed the hyperparameters for notational simplicity.

We utilize the mean field approach by introducing the factorized variational distribution q(U,V, γ^u, γ^v) =q(U)q(V)q(γ^u)q(γ^v).

Applying the decomposition as in Equation (1.4),

lnp(R) =L(q) +KL(q||p),

whereKL(·||·)is the Kullback–Leibler divergence andL(·)is the evidence lower bound. In particular, L(q) =

q(U)q(V)q(γ^u)q(γ^v) ln

p(R,U,V, γ^u, γ^v) q(U)q(V)q(γ^u)q(γ^v)

dUdVdγ^udγ^v,

which is the quantity we aim to maximize with respect toqas it means an improved approximation to the posterior, as measured by the Kullback–Leibler divergence.

The optimal distribution satisfies

lnq^∗(U) =EV,γ^u,γ^v[ln{p(R|U,V)p(U|γ^u)p(V|γ^v)p(γ^u)p(γ^v)}] + const.

which is non-conjugate due to the form ofp(R|U,V)and therefore the integral is intractable. How-ever, by using Taylor approximation on the symmetrized logistic function (Jaakkola’s bound [12, 87])

σ(z)≥σ(z, ξ) =˜ σ(ξ) exp

z−ξ 2 − 1

2ξ

σ(ξ)−1 2

z²−ξ² ,

we can lower boundp(R|U,V)at the cost of introducing local variational parametersξ_ij, yielding a new boundL˜which contains at most quadratic terms. Collecting the terms quadratic and linear in Ugives

lnq^∗(U) =EV,γ^u,γ^v[ln{p(R|U,V)p(U|γ^u)}]

=−E[γ_u] 2

K_ikku_i−u_kk²− α_u 2

u^T_iu_i

+cX

m^u_im^v_jR_iju^T_i E[v_j]

m^u_im^v_j((c−1)R_ij + 1) lnσ −u^T_i v_j

+ const.

≥ − E[γu] 2

K_ikkui−u_kk²− αu

2 X

u^T_iui+cX

m^u_im^v_jRiju^T_i E[vj]

m^u_im^v_j((c−1)Rij+ 1)

−u^T_i E[vj]

2 − 1

2ξij

σ(ξij)−1 2

u^T_i E

vjv^T_j ui

=−1

2tr U^TQ^uU

u^T_i



X

RˆijξˆijE

vjv^T_j

ui+X

u^T_i



X

R⁰ijE[vj]



,

where

Q^u = E[γu] 2

K^uT1−K^u

+αu

2 I, ξˆij =− 1

2ξij

σ(ξij)−1 2

, Rˆ_ij =m^u_im^v_j((c−1)R_ij + 1), R⁰ij =m^u_im^v_jcRij +1

2Rˆij.

Since this expression is quadratic invec(U), we conclude thatq^∗is Gaussian and the parameters can be found by completing the square. In particular,

q^∗(vec(U)) =N(vec(U)|φ,Λ⁻¹) Λ=Q^u⊗I−2·blkdg_i



X

Rˆ_ijξˆ_ijE

v_jv^T_j

, (4.17)

φ=Λ⁻¹veci



X

R⁰ijE[vj]



, (4.18)

whereblkdg_idenotes the operator creating anL·I×L·Iblock-diagonal matrix fromI L×L-sized blocks. Figure 4.5 illustrates the structure of the precision matrixΛ. The variational update forq(V) can be derived similarly. The most computationally intensive operation is computing

E vjv_j^T

= Cov(vj) +E[vj]E[vj]^T (4.19)

I×L

L L

−2·P

jRˆ3jξˆ3jE

vjv^T_j

Q₃₃·I

+

Figure 4.5. Structure of the precision matrixΛas indicated by Equation (4.17).

Remark. When computingφ, one can get away without explicitly invertingΛby using the conjugate gradient (CG) algorithm for solving linear systems. In fact, the sparsity and special structure ofΛ can be exploited to make the algorithm faster. However, computingE[vjv_j^T]requires knowing the diagonal blocks of the inverse; later, Equation (4.22) needs all blocks. Unfortunately, the inverse usually comes out as dense, which rules out inversion methods exploiting sparsity,e.g.those involving Givens rotations. Inverting matrices with this special structure remains an open question, as we settle for a blocked Cholesky decomposition implemented on the GPU, which makes the algorithm applicable in practical dimensions.

The optimal value of the local variational parametersξ_ij can be computed by writing the expec-tation of the joint distribution in terms ofξand setting its derivative to zero. In particular,

L˜(ξ) =X

Rˆ_ij

lnσ(ξ_ij)− ξij

2 − 1 2ξ_ij

σ(ξ_ij)− 1

2 ξ_ij² −Eh

u^T_i v_j2i ,

from which [4, 87]

ξ²_ij =Eh u^T_i vj

E[ui]^T E[vj]2

E[U_li]²V[V_lj] +V[U_li]E[V_lj]²+V[U_li]V[V_lj]. (4.20) Since the model is conjugate with respect to the kernel weights, we can use the standard update formulas for the Gamma distribution

q^∗(γ^u_n) =Ga(γ_n^u|a⁰, b⁰) a⁰ =a+I²

2 (4.21)

b⁰ =b+ 1 2E^U

K^u_n,ikkui−u_kk²

=b+ 1 2

K^u_n,ik E u^T_i u_i

−2E u^T_i u_k

u^T_ku_k

. (4.22)

Algorithm 2Pseudocode of the VB-MK-LMF algorithm.

Require:

Data:R∈ {0,1}^I×J,K_u ∈ R^I×In

,K_v ∈ R^J×Jm

Parameters:L, D∈N,au, bu, av, bv, αu, αv, c∈R⁺. .

Calculatem^u ∈ {0,1}^I;m^v ∈ {0,1}^J. Initializeq^∗(γ^u)andq^∗(γ^v)using Eq. (4.16).

Initializeq^∗(U)andq^∗(V)using Eq. (4.15).

fordin1. . . Ddo

Update Jaakkola’s boundξusing Eq. (4.20).

Updateq^∗(U)using Eq. (4.17) and (4.18).

Update Jaakkola’s boundξusing Eq. (4.20).

Updateq^∗(V)the same way asq^∗(U).

Updateq^∗(γ^u)andq^∗(γ^v)using Eq. (4.21) and (4.22).

returnq^∗(U),q^∗(V),q^∗(γ^u),q^∗(γ^v).

4.3 A Bayesian matrix factorization model with non-random

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 70-75)