Analysis - Multiple kernel data and knowledge fusion methods in drug discovery

Here we present a practical analysis of the`^p-MKL one-class SVM-based ranking strategy. Although some theoretical results on generalization error bounds of one-class SVMs are also available, they are of little practical use, and to our knowledge, no improvements have been made since. In particular, current generalization error bounds do not involve the marginρor the kernel, and do not translate to the ranking scenario.

2.3.1 Effect of normalization on kernel weights

By optimizing 2.5, the algorithm actively seeks out at leastνP support vectors which are as “dis-similar” as possible, to account for every region in the support of the underlying distribution. In particular, it assigns non-zero values to support vectors in a way that the sum of their inner products in the combined RKHS

αiφ(x¯ i), αjφ(x¯ j)

=α^TKα¯

is minimal in some sense. From Equation 2.6, the kernel weights are tied to the same inner products, i.e.givenα, the weight of a kernel is a monotonically increasing function of its values corresponding to support vectors (also reflected in the saddle point problem 2.2). Intuitively, kernels are weighted on the basis of how “self-similar” the training set is according to that particular kernel, or, more precisely, how large the radius of the enclosing ball in the RKHS is (Figure 2.2). This also shows that significant biases can be introduced through the magnitude of the values in the kernels, highlighting the need for proper normalization (most commonly, a unit trace normalization step is taken before training).

The argument above also underlines the observation that the strategy of introducing a large number of kernels without careful selection and relying on the MKL algorithm to select the best performing ones is largely unfounded.

2.3.2 Effect of heterogeneity on ranking performance

The heterogeneity of the training set can also seriously plummet ranking performance. Intuitively, whenever the training set is inhomogeneous, training entities occupy a ball with a large radius in the RKHS, enclosing a greater fraction of the test samples. Equivalently, the support of the estimated distribution is quite large, with training samples situated on the boundary. This essentially inverts the ranking strategy: entities “similar” to the training set receive lower scores and the training set gets pushed back to the bottom of the list. To give a more formal analysis, first we recover the margin from the dual objective.

Proposition 2.3.1. The margin for the`^p-regularized one-class SVM is proportional to the dual objective and has the form

ρ= 1

νPα^TKα¯ = 2λ

νP kmk²p=− 4

νPD(α), whereK¯ =P

km_kK_k.

Proof. For any support vector with0< αi <1it holds that

∇iD(α) =X

m_k(K_kα)_i =ρ,

G⁰₁ G⁰₂

G⁰₃ B⁰₁

B⁰₂ B⁰₃ ρ^B<1

ρG≈1 B1 B2

G1 G2

B⁰₁ B⁰₂ B⁰₃

G⁰₁ G⁰₂

G⁰₃ B1 B2

G1 G2

G⁰_i G⁰_j

B⁰_k

B⁰_l ρB<1 ρG≈1 H_G∼ρ_G⁻¹

∼ρ₋

B 1

B⁰_k B⁰_l G⁰_i

G⁰_j ρG<1

ρB≈1 HB

∼ρ

−1 B

H_G

∼ρG−1

Figure 2.2. Illustration of the connection between kernel weights and the size of the region in which training samples lie in the RKHS. The training setGoccupies a small region in the RKHS correspond-ing to the first kernel, but much more scattered in the second, resultcorrespond-ing in a large weight forK1and a small weight forK2. The reverse applies to training setB. The “heterogeneity” of the training sets are inversely related to the marginρ.

whereDdenotes the dual objective. Using the first constraint in (2.5), one can write X

mkα^TKkα=ρ·1^Tα=νP ρ, from which the Proposition follows by (2.3) and (2.4).

The unnormalized average score of the entities is 1

T X

αi

mkkk(xi,x) = 1

T1^TK¯⁰α,

whereK¯⁰ consists of the columns of the full combined kernel matrix corresponding to the training samples. In other words,K¯⁰contains inner products ofallsamples, training and test, with the training

0.25 0.50 0.75 1.00

1 2 3 4

ISS/UAS

AUROC

1 2 3 4 5

1 2 3 4

ISS/UAS

Rho/avg

Figure 2.3. Left side: the ISS/UAS ratio predicts low AUROC values in an experiment conducted on 1041chemical structural descriptors in 135drug classes. Right side: estimating the realρ/average score ratio with the ISS/UAS ratio (2.7) shows a strong correlation.

set.Tdenotes the overall number of training and test entities. Intuitively, heterogeneity starts causing problems, when the average score exceeds the margin,i.e.

T1^TK¯⁰α≥ 1

νPα^TKα.¯

This suggests an easy heuristic to spot heterogeneous training sets. Keeping in mind thatα≤1, we define the ratio

ISS

U AS ··= νP T

1^TK1¯

1^TK¯⁰1, (2.7)

whereISS andU AS stand for Intra-Set Similarity and Universal Average Similarity, respectively.

We found that this ratio approximates the real margin/average score ratio fairly well in a numerical experiment involving1041FDA-approved drugs with their structural descriptions in135drug classes, as obtained from the ATC drug classification. In particular, MACCS, MolconnZ and 3D pharmacopore-based fingerprints were used. ATC classes with5or less members were excluded from the study. The results are shown on the right side of Figure 2.3.

We also investigated the connection between the ISS/UAS ratio and predictive performance. We conducted a100×80%−20%cross-validation experiment for each ATC class, which involved

• Splitting the class80%−20%in a random manner,

• Training using the80%set,

• Assessing ranking performance using the20%set and the AUROC measure. This scheme was repeated100times for each class with average AUROC values reported.

Remark. The AUROC value is the probability of assigning a higher score to an in-class sample than a randomly selected out-of-class sample, with the entities of the training set excluded. In other words,

Table 2.1. ATC classes with the lowest AUROC values<0.5, demonstrating that the ranking strategy breaks down with highly heterogeneous training sets (as indicated by their name).

Class Name AUROC

N05CM Other hypnotics and sedatives 0.128

N01AX Other general anesthetics 0.183

D01AE Other antifungals for topical use 0.260 D09AA Medicated dressings with antiinfectives 0.263

R05CB Mucolytics 0.264

D11AX Other dermatologicals 0.310

R02AA Antiseptics 0.340

A02BX Other drugs for peptic ulcer and GORD 0.370 L01XX Other antineoplastic agents 0.413

B05CA Antiinfectives 0.415

it is the probability that the predicted pairwise ranking of a randomly selected pair is correct. The maximum value is1, which corresponds to a perfect ranking.0.5is completely random ranking and

<0.5indicates that the ranking algorithm is “reversed”.

The left side of Figure 2.3 demonstrates that the ratio is indeed a good indicator for heterogeneity and can predict low AUROC values. It is worth mentioning that there is a significant number of classes with an AUROC value outside the interpretable range, where the rankings are essentially inverted. Table 2.1 shows some of these classes, which all correspond to highly heterogeneous classes, as indicated by their name.

In document Multiple kernel data and knowledge fusion methods in drug discovery (Pldal 40-43)