• Nem Talált Eredményt

Variational Bayesian Multiple Kernel Logistic Matrix FactorizationFactorization

3. The matrix factorization module computes predicted outputs as inner products of the weighted projections:

p fij|H,H0

=N fij|hTi h0j,1 ,

wherehi andh0j are columns of the matrices containing weighted projections, coming from different sides of the model.

4. The classification module givesyij = +1wheneverfij > ν and−1otherwise.

Since the model is conditionally conjugate, Gibbs sampling or mean field variational inference can be straightforwardly utilized, as the updates are available in closed form. For computational efficiency, Gönenet al.choose the variational method.

4.1.2.5 Macau

Macau by Simm et al.[69] extends the matrix factorization framework to tensor factorization by replacingRwith a higher order tensor. The distribution ofRis given by

p R|U1,U2, . . . ,UM, α

= Y

RIknown

N RI|δ(uI), α−1 ,

whereI is a multi-index(i1, i2, . . . , iM),δis a multilinear map which sums diagonal components of uI=ui1⊗ · · · ⊗uiM, anduimis theimth column of the factorUm. The latent representations come from a hierarchical Bayesian linear model

p(uim|xi,W,b,Λ) =N uim|WTφm(xi) +b,Λ−1 ,

with a Normal–Wishart prior on the parametersbandΛ, as in Equation (4.6). Intuitively, this en-sures that whenever there are only a handful of observations (or none at all) in the interaction matrix corresponding this particular entity, the latent representations are heavily influenced by the features φm(x); their contribution is more modest if there are many observations. The same idea can be utilized to incorporate features about the interactions themselves, coined “relation features”. Addi-tionally, Macau imposes a zero-mean Gaussian prior on the columns of the weight matrixW as

p(W|Λ,λ) =N

vec(W)|0,(Λ⊗λI)−1 , p(λ|a, b) =Ga(λ|a, b).

Similarly to BPMF, Macau uses Gibbs sampling for full Bayesian inference.

4.2 Variational Bayesian Multiple Kernel Logistic Matrix

a

u

b

u

γγγ

u

α

u

K K K

un

U U U

m m m

u

a

v

b

v

γγγ

v

α

v

K K K

vm

VVV

m m m

v

RRR

c

n m

Figure 4.4. Probabilistic graphical representation of the Bayesian multiple kernel logistic matrix fac-torization model.

4.2.1 Probabilistic model

LetR∈ {0,1}I×Jdenote the matrix of the interactions, whereRij = 1indicates a known interaction between theith drug andjth target. As in Bayesian logistic regression, we put a Bernoulli distribution on eachRij with parameterσ(uTi vj)whereσis the logistic sigmoid function andui,vjare theith andjth columns of the respective factor matricesU ∈ RL×IandV ∈ RL×J. One can think ofui andvj asL-dimensional latent representations of theith drug and jth target, and thea posteriori probability of an interaction between them is modeled byσ(uTi vj).

Similarly to NRLMF, we utilize an augmented version of the Bernoulli distribution parameterized byc≥1which assigns higher importance to positive examples. As we mentioned in Section 4.1.2.2, NRLMF uses a post-training weighted average to infer interactions corresponding to empty rows and columns inR(i.e.these would have to be estimated without using any corresponding observations).

We account for these cases by introducing variablesmu,mv ∈ {0,1}indicating whether the row or column is empty. In such cases, we only want to use the side information in the prediction. Therefore, we formulate the conditional distribution of the interactions, given the factors and indicators, as

p(R|U,V, c,mu,mv)∝Y

i

Y

j

h

σ uTi vj

cRij

1−σ uTi vj

1−Rijimuimvj

. (4.14)

Specifying priors onU andV presents an opportunity to incorporate multiple sources of side information. In particular, we can use a Gaussian distribution with a weighted linear combination of kernel matricesKn,n = 1,2, . . . in the precision matrix, which is the probabilistic version of a combined Laplacian–L2regularization scheme (cf. Eq. (4.13))

p(U|αu, γu,Ku)∝Y

i

Y

k

exp (

−1 2

X

n

γnuKun,ikkui−ukk2 )

·Y

i

exp

−αu 2 kuik2

. (4.15) The prior onVcan be written similarly. Similarly to KBMF2MKL, to automate the learning of the optimal value of kernel weightsγnu, we impose Gamma priors on them:

p(γnu|a, b) = banu)a−1e−bγnu

Γ(a) . (4.16)

4.2.2 Variational approximation

In this Section, we apply the machinery developed in Section 1.2.2.1. We are interested in the approx-imation to the posterior

p(U,V, γu, γv|R,Kun, au, bu,Kvn, av, bv, αu, αv, c) by the variational distribution

q(U,V, γu, γv).

We can obtain such an approximation by maximizing a lower bound on the expectation p(R) =

Z

p(R|U,V)p(U|γu)p(V|γv)p(γu)p(γv)dUdVdγuv,

with respect toU,V,γuv, where we suppressed the hyperparameters for notational simplicity.

We utilize the mean field approach by introducing the factorized variational distribution q(U,V, γu, γv) =q(U)q(V)q(γu)q(γv).

Applying the decomposition as in Equation (1.4),

lnp(R) =L(q) +KL(q||p),

whereKL(·||·)is the Kullback–Leibler divergence andL(·)is the evidence lower bound. In particular, L(q) =

Z

q(U)q(V)q(γu)q(γv) ln

p(R,U,V, γu, γv) q(U)q(V)q(γu)q(γv)

dUdVdγuv,

which is the quantity we aim to maximize with respect toqas it means an improved approximation to the posterior, as measured by the Kullback–Leibler divergence.

The optimal distribution satisfies

lnq(U) =EV,γuv[ln{p(R|U,V)p(U|γu)p(V|γv)p(γu)p(γv)}] + const.

which is non-conjugate due to the form ofp(R|U,V)and therefore the integral is intractable. How-ever, by using Taylor approximation on the symmetrized logistic function (Jaakkola’s bound [12, 87])

σ(z)≥σ(z, ξ) =˜ σ(ξ) exp

z−ξ 2 − 1

σ(ξ)−1 2

z2−ξ2 ,

we can lower boundp(R|U,V)at the cost of introducing local variational parametersξij, yielding a new boundL˜which contains at most quadratic terms. Collecting the terms quadratic and linear in Ugives

lnq(U) =EV,γuv[ln{p(R|U,V)p(U|γu)}]

=−E[γu] 2

X

i

X

k

Kikkui−ukk2− αu 2

X

i

uTiui

+cX

i

X

j

muimvjRijuTi E[vj]

+X

i

X

j

EV

muimvj((c−1)Rij + 1) lnσ −uTi vj

+ const.

≥ − E[γu] 2

X

i

X

k

Kikkui−ukk2− αu

2 X

i

uTiui+cX

i

X

j

muimvjRijuTi E[vj]

+X

i

X

j

muimvj((c−1)Rij+ 1)

−uTi E[vj]

2 − 1

ij

σ(ξij)−1 2

uTi E

vjvTj ui

=−1

2tr UTQuU

+X

i

uTi

X

j

ijξˆijE

vjvTj

ui+X

i

uTi

X

j

R0ijE[vj]

,

where

Qu = E[γu] 2

KuT1−Ku

u

2 I, ξˆij =− 1

ij

σ(ξij)−1 2

, Rˆij =muimvj((c−1)Rij + 1), R0ij =muimvjcRij +1

2Rˆij.

Since this expression is quadratic invec(U), we conclude thatqis Gaussian and the parameters can be found by completing the square. In particular,

q(vec(U)) =N(vec(U)|φ,Λ−1) Λ=Qu⊗I−2·blkdgi

X

j

ijξˆijE

vjvTj

, (4.17)

φ=Λ−1veci

X

j

R0ijE[vj]

, (4.18)

whereblkdgidenotes the operator creating anL·I×L·Iblock-diagonal matrix fromI L×L-sized blocks. Figure 4.5 illustrates the structure of the precision matrixΛ. The variational update forq(V) can be derived similarly. The most computationally intensive operation is computing

E vjvjT

= Cov(vj) +E[vj]E[vj]T (4.19)

I×L

L L

2·P

jRˆ3jξˆ3jE

vjvTj

Q33·I

+

Figure 4.5. Structure of the precision matrixΛas indicated by Equation (4.17).

Remark. When computingφ, one can get away without explicitly invertingΛby using the conjugate gradient (CG) algorithm for solving linear systems. In fact, the sparsity and special structure ofΛ can be exploited to make the algorithm faster. However, computingE[vjvjT]requires knowing the diagonal blocks of the inverse; later, Equation (4.22) needs all blocks. Unfortunately, the inverse usually comes out as dense, which rules out inversion methods exploiting sparsity,e.g.those involving Givens rotations. Inverting matrices with this special structure remains an open question, as we settle for a blocked Cholesky decomposition implemented on the GPU, which makes the algorithm applicable in practical dimensions.

The optimal value of the local variational parametersξij can be computed by writing the expec-tation of the joint distribution in terms ofξand setting its derivative to zero. In particular,

L˜(ξ) =X

i

X

j

ij

lnσ(ξij)− ξij

2 − 1 2ξij

σ(ξij)− 1

2 ξij2 −Eh

uTi vj2i ,

from which [4, 87]

ξ2ij =Eh uTi vj

2i

=

E[ui]T E[vj]2

+X

l

E[Uli]2V[Vlj] +V[Uli]E[Vlj]2+V[Uli]V[Vlj]. (4.20) Since the model is conjugate with respect to the kernel weights, we can use the standard update formulas for the Gamma distribution

qun) =Ga(γnu|a0, b0) a0 =a+I2

2 (4.21)

b0 =b+ 1 2EU

"

X

i

X

k

Kun,ikkui−ukk2

#

=b+ 1 2

X

i

X

k

Kun,ik E uTi ui

−2E uTi uk

+E

uTkuk

. (4.22)

Algorithm 2Pseudocode of the VB-MK-LMF algorithm.

Require:

Data:R∈ {0,1}I×J,Ku ∈ RI×In

,Kv ∈ RJ×Jm

Parameters:L, D∈N,au, bu, av, bv, αu, αv, c∈R+. .

Calculatemu ∈ {0,1}I;mv ∈ {0,1}J. Initializequ)andqv)using Eq. (4.16).

Initializeq(U)andq(V)using Eq. (4.15).

fordin1. . . Ddo

Update Jaakkola’s boundξusing Eq. (4.20).

Updateq(U)using Eq. (4.17) and (4.18).

Update Jaakkola’s boundξusing Eq. (4.20).

Updateq(V)the same way asq(U).

Updatequ)andqv)using Eq. (4.21) and (4.22).

returnq(U),q(V),qu),qv).

4.3 A Bayesian matrix factorization model with non-random