Experiments - Multiple kernel data and knowledge fusion methods in drug discovery

Dataset Drugs Targets Interactions

Nuclear Receptor 54 26 90

GPCR 223 95 635

Ion Channel 210 204 1476

Enzyme 445 604 2926

Kinase 69 443 30567

Psychoanaleptics 37 82 446

Table 4.1. Dimensions of the benchmark datasets utilized in the experiments.

The joint distribution with respect toRij is

p(Rij,Θ\Rij) =N(Rij|u^T_i vj,(γ_i^cγ^r_j)⁻¹) YN n=1

N(B_ijⁿ|Rij,(σ_i^ncσ_j^nr)⁻¹)B(Xij|f(Rij, s1, s2, µ)) whereBis the Bernoulli distribution. Sampling from the conditional is not trivial, since it does not follow any well-known distribution inRij. Hence, we had to resort to a slice sampling step within the Gibbs sampling. The product of the first two terms gives a normal distribution onR_ij, which can be calculated straightforwardly from the formulas for the product of Gaussians. Separating the unimodular and bimodular cases corresponding toXij = 0andXij = 1, the boundaries of any slice can be obtained using the Newton–Raphson method.

method. VB-MK-LMF was used with 3 neighbors in each kernel,α_u = α_v = 0.1,a_u = a_v = 1, bu =bv = 10³andc= 10. The number of latent factors was set toL = 10in the Nuclear Receptor dataset andL= 15in the others. We also present a more detailed investigation of this parameter and compare its value to those in the earlier works. The number of iterations was chosen manually as20 since the variational parameters usually converged between20−50iterations.

In the multiple-kernel scenario, we compared the predictive performance of VB-MK-LMF to KBMF2MKL and KronRLS-MKL using MACCS and Morgan fingerprints with RBF and Tanimoto sim-ilarities. Target kernels provided by KronRLS-MKL did not improve the results in either case, thus only the ones computed by Yamanishiet al.were utilized. We also investigated the weights assigned to the kernels and tested robustness by introducing kernels with random values.

For the evaluation of the MK-BMF-MNAR method treating real-valued interaction data, binding affinities were collected from the public ChEMBL database [132]. We restricted the set of drug to psychoanaleptics belonging to the ATC class N06*. The dataset was composed by logarithmically transformingKi values, aggregating multiple measurements by their median values and discarding inexact measurements. Chemical fingerprints were computed using the CDK library. In particular, MACCS and Klekota–Roth fingerprints were used with the Tanimoto similarity measure. The dimen-sions of the benchmark datasets are shown in Table 4.1.

The predictive performance was evaluated using a 10× 80%-20% cross-validation setting, by computing root mean squared errors (RMSE) and comparing them to BPMF and Macau. We also present a cumulative evaluation of the submodules of MK-BMF-MNAR. In this experiment, we used N W(0,1000,I, L) for the prior of V, N(0,S^u) for U, Gamma priors were parameterized with a = 10,b = 1, Inverse Gammas with a = 1,b = 2 and8 latent factors were utilized (4for each kernel). Since BPMF cannot incorporate a background knowledge model, we also omitted their use for a fair comparison. We used500burn-in steps, which was sufficient for the convergence of Gibbs sampling, as indicated by the Geweke convergence diagnostic.

4.4.2 Results

Single-kernel AUROC and AUPRC values of VB-MK-LMF, NRLMF and one-kernel KBMF are shown in Table 4.2. According to a pairwiset-test, the method significantly outperforms NRLMF and one-kernel KBMF in terms both measures in most cases. Although still significant in some experiments, the improvement is generally more modest on the Enzyme dataset, which is by far the largest. This is not particularly surprising, as larger data sets typically mitigate the benefits of Bayesian model averaging and using side information; the vanishing effect of priors with increasing sample size will be systematically investigated later. On average, VB-MK-LMF achieves4.7%higher AUPRC values in the CVS1 setting than the second best method. In the CVS2 and CVS3 scenario, this is 2%and 7.6%, respectively. The lower AUROC and AUPRC values in these settings are explained by the lack of observations for the test drugs or test targets in the training set, resulting in a significantly harder task than in the CVS1 scenario.

The number of latent dimensions has a crucial role from computational, statistical and interpreta-tional aspects. Earlier works recommend a number between50−100in this DTI benchmark [89, 133], however, conceptually it is not clear what is to be gained going beyond the rank of the interaction matrix, which isrank(R)≤min(I, J). In the SVD context, choosing theL= rank(R)corresponds to a perfect reconstruction of the interaction matrix (at least in the known values), and leads to seri-ous overfitting in unregularized cases [123, 134]. Although overfitting is usually less of an issue with variational Bayesian approximations, a large number of latent factors still increases computational time significantly. We found that large numbers do not yield better results; in fact, the AUPRC values

Table 4.2. Single-kernel results on binary data sets. CV indicates the cross-validation setting (pairwise, drug and target, respectively). AUROC and AUPRC values were averaged over5×10runs and95%

confidence intervals were computed. In most cases, VB-MK-LMF significantly outperforms the other methods usingt-test.

AUROC (CV1)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.957±0.010 0.949±0.011 0.860±0.024 GPCR 0.976±0.003 0.960±0.004 0.911±0.004 Ion Channel 0.989±0.001 0.984±0.002 0.941±0.003 Enzyme 0.987±0.001 0.976±0.002 0.887±0.003 Kinase 0.921±0.002 0.919±0.001 0.916±0.001

AUPRC (CV1)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.773±0.030 0.723±0.042 0.533±0.047 GPCR 0.777±0.016 0.703±0.023 0.541±0.012 Ion Channel 0.916±0.007 0.863±0.012 0.763±0.009 Enzyme 0.890±0.006 0.876±0.007 0.656±0.008 Kinase 0.850±0.003 0.845±0.003 0.844±0.003

AUROC (CV2)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.939±0.021 0.896±0.023 0.845±0.023 GPCR 0.878±0.014 0.883±0.012 0.847±0.018 Ion Channel 0.812±0.026 0.800±0.026 0.785±0.021 Enzyme 0.851±0.021 0.811±0.024 0.718±0.028 Kinase 0.894±0.004 0.891±0.004 0.838±0.004

AUPRC (CV2)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.593±0.058 0.547±0.053 0.447±0.048 GPCR 0.368±0.023 0.363±0.023 0.365±0.024 Ion Channel 0.345±0.035 0.343±0.033 0.287±0.035 Enzyme 0.349±0.042 0.360±0.041 0.269±0.037 Kinase 0.803±0.009 0.797±0.010 0.735±0.009

AUROC (CV3)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.917±0.026 0.847±0.029 0.735±0.050 GPCR 0.941±0.009 0.920±0.014 0.839±0.020 Ion Channel 0.966±0.007 0.958±0.008 0.911±0.012 Enzyme 0.962±0.005 0.947±0.006 0.859±0.012 Kinase 0.767±0.018 0.763±0.018 0.740±0.022

AUPRC (CV3)

VB-MK-LMF NRLMF KBMF

Nuclear Receptor 0.601±0.081 0.456±0.079 0.352±0.070 GPCR 0.596±0.040 0.553±0.040 0.437±0.047 Ion Channel 0.826±0.021 0.788±0.028 0.695±0.024 Enzyme 0.794±0.017 0.808±0.018 0.573±0.028 Kinase 0.608±0.039 0.597±0.038 0.594±0.039

0.4 0.6 0.8

0 10 20 30 40 50

Dimension

AUPRC

Nuclear Receptor

0.2 0.4 0.6 0.8

0 10 20 30 40 50

Dimension

AUPRC

GPCR

0.4 0.6 0.8

0 10 20 30 40 50

Dimension

AUPRC

Figure 4.9. AUPR values on the three smallest datasets with varying number of latent factors. The results become saturated around 10 dimensions.

quickly become saturated around10−15dimensions in the smaller datasets (Figure 4.9). Due to the rapidly increasing runtime, the Enzyme and Kinase datasets were excluded from this experiment.

Multi-kernel AUPRC values of VB-MK-LMF, KBMF2MKL and KronRLS-MKL are shown in Ta-ble 4.3. Compared to the TaTa-ble 4.2, it is clear that both VB-MK-LMF and KBMF benefits from using multiple kernels. Moreover, as indicated by the columns, there is also an improvement in predictive performance when one combines instances of the same kernel with different neighborhood truncation values. However, when using both of these combination schemes simultaneously, advantages seem to disappear, as the AUPRC values usually do not improve or even decrease (except for the Kinase dataset). This is a known property of linear kernel combinations, observede.g.by Corteset al.: using large linear kernel combinations usually do not improve predictive performance beyond that of the best individual kernels in the combination, or simple unweighted average [93].

Table 4.4 shows kernel weights in each of the datasets, using the CVS1 setting and subsequent normalization of the weights. For the sake of illustration, we also included a random unit-diagonal positive definite kernel matrix. In the Nuclear Receptor, GPCR, Ion Channel and Enzyme datasets, the algorithm assigned more or less uniform weights to the real kernels and a lower one to the random kernel. In the Kinase dataset, the weight of the random kernel is almost zero. This underlines the validity of VB-MK-LMF’s kernel combination scheme. SettingLtoI(the rank of the kernel matrices) yields an almost zero weight to the random kernel. An intuitive geometrical explanation is that in a larger dimensional vector space, the volume of the enclosing hypersphere is generally greater, al-lowing for better separation of the latent representations. This makes spotting and down-weighting kernels with erroneous values easier for the algorithm. This property might justify increasing the number of latent dimensions beyond the rank of the interaction matrix, but only in the multi-kernel setting.

The effect of priors, or side information, seems especially pronounced at smaller sample sizes in terms of improved predictive performance, which is in line with intuitive expectations. To explore this effect quantitatively, we computed the differences in AUPRC and AUROC values while using and omitting kernels, at varying training set sizes. The results seem to indicate the existence of a “small sample size” region where using side information offer significant gains, and after which the effect of

Table 4.3. AUPRC values on binary data sets in the multiple kernel setting in CVS1. Rows correspond to the neighborhood truncation values and the combinations of the resulting kernels. Columns cor-respond to individual kernels, withMrgbeing the Morgan fingerprint,Mcsis MACCS keys,Rbf is the Gaussian Radial Basis Function andTanis the Tanimoto similarity. The original kernels provided by Yamahishiet al.are denoted byOrig. The last column was obtained by combining all kernels.

Nuclear Receptor (KBMF2MKL:0.566, KronRLS-MKL:0.522) Neighbors MrgRbf MrgTan McsRbf McsTan Orig All

2 0.749 0.758 0.742 0.735 0.754 0.779

3 0.744 0.771 0.761 0.734 0.773 0.775

5 0.732 0.757 0.739 0.724 0.755 0.756

2+3 0.750 0.765 0.754 0.736 0.757 0.758

2+3+5 0.760 0.765 0.740 0.738 0.764 0.760

GPCR (KBMF2MKL:0.622, KronRLS-MKL:0.696)

Neighbors MrgRbf MrgTan McsRbf McsTan Orig All

2 0.743 0.759 0.754 0.762 0.764 0.793

3 0.755 0.774 0.772 0.780 0.777 0.802

5 0.762 0.787 0.782 0.783 0.787 0.796

2+3 0.763 0.782 0.781 0.786 0.785 0.802

2+3+5 0.777 0.798 0.793 0.789 0.796 0.800

Ion Channel (KBMF2MKL:0.826, KronRLS-MKL:0.885) Neighbors MrgRbf MrgTan McsRbf McsTan Orig All

2 0.909 0.911 0.910 0.911 0.910 0.909

3 0.911 0.914 0.915 0.914 0.912 0.916

5 0.915 0.914 0.913 0.916 0.916 0.917

2+3 0.912 0.914 0.916 0.914 0.913 0.909

2+3+5 0.912 0.915 0.915 0.915 0.916 0.906

Enzyme (KBMF2MKL:0.704, KronRLS-MKL:0.893) Neighbors MrgRbf MrgTan McsRbf McsTan Orig All

2 0.885 0.887 0.879 0.883 0.888 0.884

3 0.885 0.890 0.885 0.882 0.890 0.895

5 0.883 0.886 0.880 0.881 0.884 0.883

2+3 0.888 0.889 0.880 0.881 0.888 0.881

2+3+5 0.887 0.889 0.881 0.878 0.888 0.875

Kinase (KBMF2MKL:0.846, KronRLS-MKL:0.561)

Neighbors - 2D 3D ECFP All

2 0.850 0.849 0.849 0.850

3 0.850 0.848 0.850 0.851

5 - 0.850 0.849 0.850 0.851

2+3 0.850 0.850 0.850 0.853

2+3+5 0.851 0.851 0.850 0.854

Table 4.4. Normalized kernel weights with an extra positive definite, unit-diagonal, random valued kernel matrix. The number of latent dimensions was not altered in this experiment. Setting the number of latent factors to I (the rank of the kernel matrix) zeroes out the weight of the random kernel.

MrgRbf MrgTan McsRbf McsTan Yam Random Nuclear Receptor 0.175 0.176 0.175 0.175 0.175 0.123

GPCR 0.173 0.173 0.172 0.172 0.172 0.138

Ion Channel 0.176 0.176 0.176 0.176 0.176 0.120

Enzyme 0.176 0.176 0.176 0.176 0.176 0.119

- 2D 3D ECFP Random

Kinase - 0.300 0.283 0.398 0.019

0.0 0.1 0.2

0.25 0.50 0.75

Training size

Diﬀerence

Diﬀerence of AUPRC values

0.00 0.05 0.10 0.15

0.25 0.50 0.75

Training size

Diﬀerence

Diﬀerence of AUROC values

Figure 4.10. The effect of priors on predictive performance with increasing sample sizes. The differ-ence between the values using and omitting kernels gradually vanishes as the training size increases.

priors gradually vanishes (Figure 4.10).

The MK-BMF-MNAR method was evaluated in a cumulative manner. The baseline performance was established using no side information (i.e. using the identity matrix in the Gaussian Process priors) and disabling the missing data model. The MKL module was first tested using MACCS finger-prints, then with MACCS and Klekota–Roth fingerprints. Finally, the missing data model was also enabled. The RMSE values of MNAR, Macau and BPMF are shown in Table 4.5. MK-BMF-MNAR is on par with Macau using only MACCS fingerprints and both outperform BPMF which does not utilize side information. Note that MK-BMF-MNAR outperforms BPMF even with no side infor-mation, which underlines the benefits of its more sophisticated noise model, which utilizes a product of per-entity noise variablesγ^candγ^rinstead of a single Gaussian noise hyperparameter employed in BPMF. The inclusion of the second kernel yields an additional improvement in the RMSE value which indicates the advantage of Multiple Kernel Learning. Finally, using the missing data model

Table 4.5. RMSE values in the drug–target interaction prediction task in a 80%-20% cross-validation setting. “nK” denotes the number of kernels used (in the casen= 0, the identity matrix was used).

“MDM” denotes the utilization of the missing data model.

Mk-BMF-MNAR Macau BPMF

2K+MDM 2K 1K 0K

Mean 0.669 0.698 0.733 0.767 0.749 0.817 StDev 0.041 0.017 0.032 0.075 0.058 0.132 Diff 0.126 0.050 0.087 0.176 0.159 0.392

-2 -1 0 1 2

500 1000 1500

Z-score

-2 -1 0 1 2

500 1000 1500

First iteration

Z-score

0 1 2 3 4

0 1000 2000 3000

Value

0 1 2 3 4

0 1000 2000 3000

Iteration

Value

Figure 4.11. Geweke–Brooks plots demonstrating the convergence of the Gibbs inference for high and low affinities inR. On the right, corresponding trace plots are shown.

gives best results in terms of RMSE, which demonstrates the applicability of MNAR-type models in the DTI prediction context.

To investigate the convergence of the Gibbs sampler, we computed the Geweke convergence di-agnostic and the trace plots. The former splits the Markov Chain into two windows and computes the sample mean and asymptotic variance in both windows (usually, the first 10%and last50%);

convergence is assessed on the basis of a standardZ-score,i.e. whether there is a significant differ-ence between the sample means. The Geweke–Brooks plot is obtained by sliding the first window along the chain and computing theZ-score repeatedly. We found that 500 burn-in steps was suffi-cient for convergence for both high and low values inR(Figure 4.11). Overall,3000iterations were utilized; the trace plots showed sufficient mixing. We also examined the resulting low-dimensional representations of the drugs,i.e. the columns of U. Figure 4.12 illustrates the correlation between column–column similarities and corresponding kernel values and underlines the validity of incorpo-rating side information via Gaussian Process priors.

0.00 0.25 0.50 0.75 1.00 1.25

Figure 4.12. Correlation between U and the inner product values of K after a run with 15 latent factors and one kernel. Similarities ofuito the fixed columnu1for alliare illustrated with black dots.

The corresponding kernel valuesK1iare denoted with a red line. The pairs{(sim(u1,ui),K1i)}^Ii=1

were sorted w.r.t.K_1i.

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 80-87)