Multiple kernel data and knowledge fusion methods in drug discovery

(1)

fusion methods in drug discovery

PhD dissertation by

Bence Bolgár

Advisor

Péter Antal, PhD (BME)

(2)

Budapesti Műszaki és Gazdaságtudományi Egyetem Villamosmérnöki és Informatikai Kar

Méréstechnika és Információs Rendszerek Tanszék Budapest University of Technology and Economics Faculty of Electrical Engineering and Informatics Department of Measurement and Information Systems H-1117 Budapest, Magyar tudósok körútja 2.

(3)

Declaration of own work and references

I, Bence Bolgár, hereby declare, that this dissertation, and all results claimed therein are my own work, and rely solely on the references given. All segments taken word-by-word, or in the same meaning from others have been clearly marked as citations and included in the references.

Nyilatkozat önálló munkáról, hivatkozások átvételéről

Alulírott Bolgár Bence kijelentem, hogy ezt a doktori értekezést magam készítettem és abban csak a megadott forrásokat használtam fel. Minden olyan részt, amelyet szó szerint, vagy azonos tartalomban, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás megadásával megjelöltem.

Budapest, 2017. 09. 22.

Bolgár Bence

(4)

Acknowledgements

First and foremost I would like to express my gratitude to my advisor Dr. Péter Antal for his contin- uous support and guidance. His ideas, friendly advice and genuine research enthusiasm helped me greatly in widening my horizon on my way to becoming a researcher. I am also grateful to my family for their love and understanding through the years. A special thanks goes to my colleagues at the COMBINE workgroup and in the Department of Organic Chemistry, Ádám Arany, Dr. Balázs Balogh and Prof. Péter Mátyus, without whom this work would not have been possible. I thank my teacher, Edit Kézdy, for inspiring me towards the natural sciences. And finally, I thank all my friends in my boy scout troop, for making these years much more exciting than they should have been.

(5)

Summary

Large-scale data and knowledge integration is a central challenge in modern artificial intelligence and machine learning research. A prominent application area with staggering amounts of heterogeneous data and knowledge is drug discovery, especially drug repositioning and drug–target interaction prediction. Chemical structure, target profiles, side effects, drug usage patterns or gene expression mea- surements can become essential information sources aiding prediction, however, their joint utilization in computational models is still an open question. In this dissertation, we explore computational data and knowledge fusion methods to cope with these challenges; in particular, we investigate learning with multiple kernels, which, imitating analogical reasoning, utilizes the language of similarities.

The first part of the dissertation focuses on the drug repositioning task, in which we develop a general workflow for finding already approved drugs for application in new indications. The basis of the algorithm is the one-class adaptation of `^p-regularized multiple kernel Support Vector Ma- chines (Thesis I.1), for which we present a theoretical analysis, numerical evaluation (Thesis I.2) and application methodology, clarifying its benefits and limitations.

The second part explores integrating numerical descriptive and logical network-based knowledge in biomedical knowledge fusion. We propose a solution which uses pairwise distances as a common ground; in particular, we describe a multiple kernel-based distance metric learning algorithm, capable of incorporating pairwise equivalence relations and entity-wise similarity matrices (Thesis II.1). We derive an efficient strategy to optimize the convex objective function of the method using a stochastic gradient projection algorithm, along with its GPU implementation (Thesis II.2). Qualitative and quantitative results show that this approach is on par with earlier one-class models and outperform earlier distance metric learning algorithms in terms of predictive performance, while providing a consistent multi-class solution (Thesis II.3).

In the third part, we investigate the fusion of heterogeneous information sources in the case of predicting a large number of drug–target interactions simultaneously. In particular, we ex- tend the Bayesian matrix factorization technique along domain-specific, knowledge-intensive direc- tions,e.g. integrating entity-wise background knowledge using Gaussian Process priors, integrating interaction-wise background knowledge and using explicit non-random missingness models (The- sis III.2). We also develop a Bayesian multiple kernel adaptation of an earlier logistic matrix factorization method, which unifies the advantages of multiple kernel learning, weighted observations, Lapla- cian regularization, and explicit modeling of probabilities of binary drug-target interactions (The- sis III.1). We derive a variational approximation scheme and implement it on the GPU, leading to a significant increase in computational performance. We show that both methods achieve better predictive performance in standard benchmarks compared to earlier methods (Thesis III.3) and discuss their ability to predict promiscuity and druggability (Thesis III.4).

(6)

Összefoglaló

A nagyléptékű adat- és tudásintegráció a modern mesterséges intelligencia és gépi tanulási kutatá- sok egyik központi kihívása, amely kiemelkedő fontossággal bír a bioinformatikában és azin silico gyógyszerkutatásban is. Különösen a gyógyszer-újrapozicionálás és gyógyszer-célpont interakciók predikciójának területén jelent meg hatalmas, heterogén adat- és tudásmennyiség, amely magába foglal például kémiai szerkezeti leírásokat, mellékhatás- és célpontprofilokat, gyógyszerhasználati adatokat, génexpressziós vizsgálatok eredményeit. Ezen heterogén „tudásdarabkák” kombinációja, együttes felhasználása számítógépes modellekben azonban máig megoldatlan kérdés. Az értekezés- ben adat- és tudásfúziós eljárásokat, pontosabban többkerneles tanulási módszereket vizsgálunk meg, amelyek az analogikus gondolkodás mintájára a hasonlóságokat használják fel közös nyelvként.

A disszertáció első része egy általános munkafolyamatot ad a gyógyszer-újrapozicionálás, vagyis már törzskönyvezett gyógyszerek új indikációkban történő alkalmazhatóságának jóslására, amelynek alapját az`^pregularizált többkerneles szupportvektor-gépek egyosztályos adaptációja adja (I.1. tézis).

Megmutatjuk az algoritmus előnyeit és alkalmazhatóságának határait elméleti szempontból és numerikus kísérletek útján is (I.2. tézis), valamint alkalmazását egy általános munkafolyamatban.

A második rész a leíró és hálózat-jellegű tudás integrációját tárgyalja az orvosbiológiai tudásfú- zióban. A javasolt megoldásunk a fúziót páronkénti hasonlóságokon keresztül valósítja meg. Eb- ben a fejezetben egy többkerneles metrika-tanulási algoritmust fejlesztünk, amely képes hasonlósági mátrixokat és páronkénti ekvivalencia-relációkat befogadni (II.1. tézis). A feladat egy konvex opti- malizációs problémára vezet, amelynek megoldására egy sztochasztikus gradiens-projekciós eljárást dolgozunk ki, amelynek bemutatjuk és kiértékeljük GPU-alapú implementációját is (II.2. tézis). A kvalitatív és kvantitatív kiértékelés során megmutatjuk, hogy az algoritmus prediktív teljesítmény szempontjából felveszi a versenyt a korábbi egyosztályos modellekkel, miközben egy konzisztens, többosztályos megoldást ad; teljesítménye egyúttal felülmúlja a korábbi metrika-tanulási algoritmu- sokét (III.3. tézis).

A harmadik rész a heterogén információforrások fúzióját tárgyalja nagy számú gyógyszer–

célpont interakció együttes jóslásában. Kiterjesztjük a bayesi mátrixfaktorizációs módszertant enti- tásszintű háttértudás befogadására Gauss-folyamatok felhasználásával, valamint megoldást adunk az interakció-szintű háttértudás integrációjára és nem véletlenszerű hiányzás kezelésére (III.2. tézis). Le- írjuk továbbá egy korábbi logisztikus mátrix-faktorizációs módszer többkerneles, bayesi adaptációját, amely egyesíti a többkerneles tanulás, súlyozott megfigyelések, Laplace-típusú regularizáció előnyeit, valamint lehetővé teszi bináris gyógyszer-célpont interakciók valószínűségének explicit modellezé- sét (III.1. tézis). A modellben való hatékony következtetéshez kidolgozunk egy variációs approximá- ciós sémát, amelynek GPU-alapú implementációja a számítások jelentős gyorsulását eredményezi. A numerikus kiértékelés során megmutatjuk, hogy mindkét módszer prediktív teljesítménye meghalad- ja a korábbi módszerekét standard benchmark adathalmazokon (III.3. tézis), valamint lehetővé teszik gyógyszer-promiszkuitás jóslását (III.4. tézis).

(7)

Contents vii

List of Figures ix

List of Tables x

Notation xi

1 Introduction 1

1.1 Knowledge and data fusion . . . 2

1.2 Machine learning . . . 4

1.2.1 Supervised learning . . . 4

1.2.2 Inference in Bayesian models . . . 10

1.3 Computational drug discovery . . . 15

1.3.1 Representation of compounds . . . 16

1.3.2 Similarity and virtual screening . . . 16

1.3.3 Drug–target interaction prediction . . . 17

2 Prioritization with Multiple Kernel Support Vector Machines 21 2.1 Background . . . 21

2.1.1 Data fusion through linear kernel combinations . . . 21

2.1.2 Prioritization with one-class SVMs . . . 24

2.1.3 Enrichment analysis . . . 25

2.2 Prioritization using`^p-regularized MKL . . . 26

2.3 Analysis . . . 28

2.3.1 Effect of normalization on kernel weights . . . 28

2.3.2 Effect of heterogeneity on ranking performance . . . 28

2.4 Application in drug repositioning . . . 31

2.5 Summary . . . 33

3 Fusion of multiple kernels and equivalence relations 35 3.1 Background . . . 35

3.1.1 Distance Metric Learning . . . 36

3.1.2 Earlier works . . . 37

3.2 The MKL-DML algorithm . . . 40

3.2.1 Derivation of the MKL-DML dual . . . 41

3.2.2 Additional bounds onα . . . 43 vii

(8)

3.2.3 Regression formulation . . . 44

3.3 Solving by parallel gradient projection . . . 44

3.3.1 Implementation . . . 45

3.3.2 Computational performance . . . 46

3.4 Experiments . . . 47

3.5 Summary . . . 49

4 Bayesian matrix factorization with multiple kernels 51 4.1 Background . . . 51

4.1.1 Matrix factorization for the analysis of dyadic data . . . 52

4.1.2 Earlier works . . . 54

4.2 Variational Bayesian Multiple Kernel Logistic Matrix Factorization . . . 58

4.2.1 Probabilistic model . . . 59

4.2.2 Variational approximation . . . 60

4.3 A Bayesian matrix factorization model with non-random missing data . . . 63

4.3.1 Probabilistic model . . . 63

4.3.2 Inference using Gibbs Sampling . . . 66

4.4 Experiments . . . 68

4.4.1 Benchmark datasets and settings . . . 68

4.4.2 Results . . . 69

4.5 Discussion . . . 75

4.6 Summary . . . 78

Bibliography 83 Bibliography . . . 83

(9)

1.1 Knowledge fusion systems in bioinformatics . . . 3

1.2 Geometric interpretation of the two-class Support Vector Machine . . . 10

1.3 Mean field variational Bayesian inference . . . 12

1.4 Illustration of slice sampling . . . 14

2.1 Prioritization with one-class SVMs . . . 25

2.2 Kernel weights and the enclosing ball in the RKHS . . . 29

2.3 ISS/UAS ratio and AUROC values . . . 30

2.4 A general prioritization workflow . . . 32

3.1 Overview of DML-MKL . . . 36

3.2 Profile data for CPU and GPU implementations . . . 47

3.3 Scaling properties of CPU and GPU implementations . . . 48

4.1 Overview of the matrix factorization workflow for DTI prediction . . . 52

4.2 Matrix factorization with side information . . . 53

4.3 Graphical models corresponding to PMF and BPMF . . . 55

4.4 Probabilistic graphical representation of the logistic matrix factorization model . . . 59

4.5 Structure of the precision matrixΛ . . . 62

4.6 Matrix factorization with Gaussian Processes . . . 64

4.7 Bayesian matrix factorization with non-random missing data . . . 65

4.8 Bump function . . . 66

4.9 AUPR values on the three smallest datasets with varying number of latent factors . . . . 71

4.10 The effect of priors on predictive performance with increasing sample sizes . . . 73

4.11 Geweke–Brooks plots demonstrating the convergence . . . 74

4.12 Correlation betweenU and the inner product values ofK . . . 75

4.13 Latent representations of drugs and targets in the Ion Channel dataset . . . 76

4.14 Parallel coordinates visualization of the latent representations withL = 10in the Ion Channel dataset. . . 78

4.15 Number of known targets of each drug in the datasets vs. the expected number of interactions as predicted by VB-MK-LMF . . . 79

4.16 Expected number of interactions as predicted by VB-MK-LMF for each drug in the GPCR dataset . . . 80

4.17 Runtime of the GPU and CPU implementations in terms of the number of latent factors in a200×200matrix factorization task . . . 81

ix

(10)

1.1 Availability of information sources in computational drug repositioning compared to HTS 16

2.1 ATC classes with the lowest AUROC values . . . 31

3.1 AUC values for various drug classes . . . 49

4.1 Dimensions of the benchmark datasets utilized in the experiments . . . 68

4.2 Single-kernel results on binary data sets . . . 70

4.3 AUPRC values on binary data sets in the multiple kernel setting . . . 72

4.4 Normalized kernel weights with an extra positive definite, unit-diagonal, random valued kernel matrix . . . 73

4.5 RMSE values in the drug–target interaction prediction task . . . 74

4.6 Top 5 predicted interactions which are not present in the datasets . . . 77

x

(11)

Throughout this dissertation, scalars are denoted by lowercase letters (y), vectors are denoted by bold lowercase letters (x) and matrices are denoted by bold uppercase letters (K). Sets and vector spaces are denoted by calligraphic letters (X).

xi Feature vector of theith training sample yi Label of theith training sample

d_i Squared distance of theith training sample pair in metric learning P The number of training samples

X The collection of all training feature vectors Y The collection of all training labels

0 Zero vector 1 Vector of ones k·kp `^pnorm

k·kF Frobenius norm d(·,·) Distance function

F Reproducing Kernel Hilbert Space (RKHS) h·,·iF Inner product in the RKHSF

k·kF Norm in the RKHSF k(·,·) Kernel function

φ(·) Reproducing kernel map

` Loss function N Normal distribution Ga Gamma distribution

IG Inverse Gamma distribution W Wishart distribution

Bern Bernoulli distribution E Expected value V Variance

q(·) Variational distribution L Evidence lower bound KL(·||·) Kullback–Leibler divergence

xi

(12)

(13)

Introduction

There is little doubt that information pervades our everyday lives. We experience unprecedented amounts of information,e.g.through the Internet or social media. Moreover, knowingly or unknow- ingly, we also share immense amounts of information about ourselves and those around us. In the academic world, measurement technologies in physics and biomedical research produce an expo- nentially growing amount of experimental results, deposited in publicly or commercially available databases. Aiming to analyze these huge pools of heterogeneous data, data science has grown to be an industry on its own right, holding great promises and also great threats. A central challenge of this industry is integrating heterogeneous information sources. Computational platforms for integration are constantly being developed and utilized in various contexts, from biomedical research through targeted advertising to mundane applications, such as self-driving cars, or more sinister uses.

However, there is a direct (arguably, the only) way for humans to experience information fusion, notably through the sensory-motoric data integration by the brain. The brain has a truly remarkable performance in integrating these heterogeneous data into a coherent subjective experience. Moreover, it is particularly good at creating a hierarchy of higher-level concepts inductively by finding common structures between experiences, or in other words, analogies. It seems that this kind of reasoning is more central to our everyday thinking and creative scientific discoveries than deductive arguments.

Coined “demonstrative reasoning” (deduction) and “plausible reasoning” (a general concept including induction and analogy), the Hungarian mathematician George Pólya writes inMathematics and Plausible Reasoning, page 5:

The difference between the two kinds of reasoning is great and manifold. Demonstrative reasoning is safe, beyond controversy, and final. Plausible reasoning is hazardous, con- troversial and provisional. Demonstrative reasoning penetrates the sciences just as far as mathematics does, but it is in itself (as mathematics is in itself) incapable of yielding essentially new knowledge about the world around us. Anything new that we learn about the world involves plausible reasoning, which is the only kind of reasoning for which we care in everyday affairs. Demonstrative reasoning has rigid standards, codified and clarified by logic (formal or demonstrative logic), which is the theory of demonstrative reasoning. The standards of plausible reasoning are fluid, and there is no theory of such reasoning that could be compared to demonstrative logic in clarity or would command comparable consensus.

Pólya is quick to point out the inherently uncertain nature of inductive and analogical arguments and pinpoints it as the real source of mathematical discovery. Undoubtedly, many (if not most) great discoveries have been made not through formal arguments but through insight and guesswork, and

1

(14)

proved rigorously afterwards. Ironically, these ideas have a tendency to appear suddenly and with a sense of certainty. As the French mathematician Jacques Hadamard writes in The psychology of invention in the mathematical field, page 18, about Poincaré:

At the moment when I put my foot on the step, the idea came to me, without anything in my former thoughts seeming to have paved the way for it, that the transformations I had used to Fuchsian functions were identical with those of non-Euclidian geometry. I did not verify the idea; I should not have had time, as, upon taking my seat in the omnibus, I went on with a conversation already commenced, but I felt a perfect certainty. On my return to Caen, for conscience sake, I verified the result at my leisure. [. . .] Then I turned my attention to the study of some arithmetical questions apparently without much success and without a suspicion of any connection with my preceding researches. Disgusted with my failure, I went to spend a few days at the seaside and thought of something else.

One morning, walking on the bluff, the idea came to, with just the same characteristics of brevity, suddenness and immediate certainty, that the arithmetic transformations of indefinite ternary quadratic forms were identical with those of non-Euclidean geometry.

He also gives similar accounts from other fields, by great minds like Langevin, Ostwald, Kekulé, Mozart and many others. Conforming to the views of psychology in that era, Hadamard ties these discoveries to unconscious thinking processes. It would seem that the most successful computational approaches to discovery exhibit similar characteristics in order to produce “new” knowledge, namely, induction and analogical reasoning, handling uncertainty, similarities and integration. In this dissertation, we explore technical counterparts of these ideas in the realm of machine learning, with a special focus on the integration of knowledge and data, and we apply them in the field of drug discovery. The rest of this introduction proceeds as follows. Section 1.1 presents a quick overview of knowledge and data fusion in artificial intelligence. Section 1.2 reviews the concepts and machinery of machine learning which will be applied throughout this dissertation. Section 1.3 provides a brief introduction into chemoinformatics and drug discovery.

1.1 Knowledge and data fusion

The problem of integrating heterogeneous data and knowledge has been appearing in many guises in the last sixty years, with a large number of disciplines developing their own theoretical and practical tools to solve it. This includes control theory, the database community, machine learning, artificial intelligence and many other fields. In particular, building large-scale knowledge bases which integrate heterogeneous knowledge from several sources has been a focal point of artificial intelligence research for many years, involving topics like question answering, automatic reasoning, expert or decision support systems.

A central question in knowledge fusion is that of representation,i.e. the development of a “common language” with sufficient expressive power to encode various types of knowledge and still al- lowing efficient inference on the knowledge base. Depending on the requirements, many formal systems were devised to reach this goal, which differ in their capability to represent complex relations, handle incompleteness, contradictions and uncertainty, or offer computationally efficient inference algorithms. For example, a widely utilized framework is formal logic, which is at the heart of many knowledge-based, domain-specific expert systems, such as legal or medical expert systems. To our work, directly relevant data and knowledge fusion frameworks are Bayesian statistics, semantic integration and computational approaches.

(15)

1995 2000 2005 2010

Bayesian Knowledge Fusion

1995-... MAGIC

2003. Altman Rank Fusion

1997. Willett

Similarity Fusion 2000. Willett

Early-late Fusion 2002. Pavlidis

Kernel Fusion 2004. Lanckriet

Order Statistics 2006. Aerts

Endeavour 2007. De Bie

ProDiGe 2011. Mordelet UMLS

1986

Gene Ontology 2000. Ashburner

Connectivity Map 2006. Lamb

OBO Foundry 2007. Smith Cytoscape

2004. Smoot IPA

2003. Ingenuity

GSEA

2005. Submarinan

TAMBIS 2000. Stevens

Kepler Bioconductor

2004. Altintas, Gentleman Biomart

2001.

Galaxy 2005. Giardine

Taverna 2006. Oinn WebGestalt

2005. Zhang STRING

2007. Bork

Bioclipse

2007. Spjut Watson 2011. IBM

Figure 1.1. Knowledge fusion systems in bioinformatics. The colors correspond to (from top to bot- tom) pathway analysis, data mining systems, general workflow systems, programming environments, semantic integration and computational data fusion systems.

Probabilistic approaches. The Bayesian framework utilizes the language of probabilities to provide a normative way for combining uncertain background knowledge with observations. In this setting, background knowledge are translated toa prioriprobability distributions and the model takes the form of a joint probability distribution (cf. generative models in Section 1.2). Besides handling uncertainty in a direct and axiomatic way, Bayesian models can be utilized to perform all kinds of reasoning: deductive, inductive and especially non-monotonic reasoning (i.e. handling incomplete information, as earlier conclusions can be invalidated by new evidence). Theoretically, inference in Bayesian models is straightforward, since the marginal of any variable can be computed from the joint distribution. In practice however, this usually leads to intractable integrals which cannot be evaluated directly. As we will show in Section 1.2.2, there are still efficient algorithms for approximate inference.

Chapter 4 applies a Bayesian approach to fusion in the context of drug–target interaction prediction.

Semantic graphs. Semantic graphs are somewhat of a black sheep among knowledge representation schemes since they do not come with inference algorithms by default; however, they provide a much more flexible way of encoding complex relations. In a way, they resemble graphs in the original graph theoretic sense, but their vertices and edges are augmented with semantic information,i.e.

they can belong to different types and can have multiple attributes. In practice, vertices represent arbitrary concepts and edges represent relations between them, with the database schema defining

(16)

the relations which can exist in the database. A commonly used semantic framework for knowledge management is the Resource Description Framework, which utilizes subject–predicate–object triplets as building blocks; the graph database (also called a triple store) can be queried using semantic query languages, such as SPARQL. An example for a graph database in the drug discovery context is the Open PHACTS database [1], which will be discussed in more detail in Section 1.3.

Case-based reasoning using similarities. Besides probabilities, the most often used unifying language in machine learning is that of similarities (or distances), first appearing in case-based reasoning and nearest neighborhood schemes [2], currently popular under the guise of kernel methods.

Similarities became a focal point in fusion researches by the pioneering work of Pavlidiset al., which, utilizing the essence of kernel methods, established three categories of combination schemes: early, late and intermediate integration [3]. Early integration corresponds to the data-level fusion of vectorial representations by simple concatenation. Late integration is performed at the decision level,i.e.

the fusion takes place after the inference is completed and the results are combinede.g. by voting methods. The most involved of the three is intermediate integration, where some intermediate representation of the data are used for combination. Usually, this means a weighted average of similarity (kernel) matrices or network integration. The methods in Chapters 2 and 3 apply the intermediate integration strategy to combine multiple representations of drug molecules.

1.2 Machine learning

Here we overview the concepts and theory necessary to follow the rest of the dissertation. Theo- rems and propositions are given without proofs but with references to the original work. A more comprehensive introduction to machine learning can be found in [4].

1.2.1 Supervised learning

In this Section, we introduce the framework and notation which will be used throughout this work;

in particular, we overview a formal treatment of inductive reasoning. Although there are multiple axiomatic systems capable of grasping the inherent uncertainty of inductive reasoning, most notably, Dempster–Shafer theory, the fuzzy approach and Bayesian statistics, it is the latter which has become somewhat of a standard language for machine learning. Let us assume we are given a set{x_i}^Pi=1

which encode our observations in some way. In practice, the goal is to recognize “patterns” among these observations, something we can later utilize to make predictions about further events. The only way for this task to make sense is to agree on some input spaceX, from which all the observations come, and exploit its structure during pattern recognition.X takes many forms throughout the machine learning literature, but by far the most popular choice is that of a real vector space endowed with an inner product, which allows doing geometry onX (as almost all machine learning algorithms do).

Whenever the full power of Euclidean geometry is not required, Banach spaces and (pseudo)metric spaces suffice (e.g.the majority of clustering algorithms). However, this choice is more out of practical necessity than theoretical considerations. Real vector spaces used in the thesis are easy to work with, even though conceptually, it seldom makes sense to “add” or “scale” vectorial representations of observations. However, more exotic input spaces has already appeared and can have vital role to grasp human analogical reasoning and creative thinking.

Example 1.2.1. Topological spaces. Using only topology in data analysis is a very exciting and rapidly developing field in machine learning. Coinedpersistent homology, these methods utilize the language of simplicial homology, and aim to give a general overview of the “shape” of the underlying

(17)

reality by extracting stable topological features from the data,e.g.connected components, the number of “holes” and their behavior [5].

Example 1.2.2. Riemannian manifolds have a much more flexible notion of geometry, which one gets by switching from the Euclidean framework to smooth manifolds and replacing the global inner product with a metric tensor. The idea is that the underlying system generating the data usually has much less degrees of freedom than the dimension of the Euclidean space traditionally used for representation.Manifold learningmethods aim to find and exploit this smooth manifold [6]. However, the real power of Riemannian geometry comes into play when it is applied to the space of models, in particular, probability distributions. A family of probability distributions can be given the structure of a smooth manifold (calledstatistical manifold) which has a non-Euclidean geometry with the Fisher information metric playing the part of the metric tensor. The collective term for these methods is information geometry[7].

Example 1.2.3. Groups. Some machine learning methods, especially kernel techniques can operate on these much more general objects with non-trivial structure. The common ground between the machine learning side and the group theoretic side is group Fourier analysis, which is a generalization of classical Fourier analysis using representation theory and character theory, and also forms the basis of the tools of regularization theory in kernel methods. In particular, group kernels can be utilized to perform learning on non-Abelian groups, such as the symmetric group, which can be applied in learning-to-rank tasks [8].

So far, we have been talking about machine learning in general. In classification, which is a form of supervised learning, every observationxi has a corresponding labelyi ∈ Y. Informally, the goal is to build a predictive model, which is capable of inferring the label of any unknown observation x ∈ X. The collection{(x_i, y_i)}^Pi=1, which forms the basis for induction, is called the training set.

Based onY, we can distinguish between prototypical supervised machine learning problems, most notably:

• Y = {−1,+1} (binary classification),e.g. diagnosing a disease from laboratory results. We are given a database of sick (yi = −1) and healthy (yi = +1) patients with corresponding lab results (xi ∈ X). The goal is to build a model which can predict whether a new patient is healthy, given his lab resultsx ∈ X. In this setting,X might be chosen asR^D,i.e. each lab report consists ofDreal numbers.

• Y =R(regression),e.g.estimating the price of a used car. The training set contains data about previously sold cars, including their attributesxiand sale priceyi. The goal is to build a model to establish a fair price for a car, given its attributesx.

Remark. Ycan also vary greatly with the application domain, including vector-valued outputs, probability distributions, sequencesetc.In Chapter 3, we present an algorithm which learns a metric subject toa prioriconstraints imposed on pairwise distances.

Now what remains is to describe what kind of models are employed in supervised learning. Based on their fundamental strategy, models come in three classes:

• Generative models. The joint distributionp(X,Y) is modeled, whereX is the collection of observations{xi}^P_i=1andY is the collection of labels{yi}^P_i=1. Generative models are usually parametric,i.e. there are explicit assumptions on the form and parameters of the joint distribution. To compute the predictive distributionp(y|x,X,Y)for a new observationx, Bayes’

rule is used. In an observational settings without interventions,p(X,Y)encodes everything there is to know about the reality producing the data, hence the name; given the “correct” joint

(18)

distribution, one could generate further observations. Generative models can also act as causal models which predict how the world would change if there was an intervention.

• Discriminative models do not have such generative qualities. Only the conditional distribution p(Y|X)is modeled and used to compute a predictive distribution. In many applications, this is perfectly fine, as long as the intricate inner workings of the underlying system generating the input observations do not influence its overall behavior in the predictive setting. In these cases, when explicitly modeling the inner dependencies would be a waste of resources, discriminative models can achieve superior predictive and computational performance.

• Decision functions. The model takes the form of a functionf ∈ F,f : X → Y for which f(xi) ∼ yifor all elements of the training set. These methods are usuallynonparametric,i.e.

there is no fixed number of parameters to tune. Although they do not even yield probabilistic estimates on the label, many of the best-performing machine learning methods fall into this category. In the next Section, we overview a framework with a strong theoretical background.

1.2.1.1 Kernel methods

The choice of the function classFin the decision function approach is not trivial. Chosen too large, functions might just “memorize” the training samples and perform no real learning at all,e.g.polyno- mials with a sufficiently large degree can be tuned to exactly match all training samples, but do not generalize to other samples. Chosen too small, the class may lack the capacity to learn the information embedded in the data. Kernel methods use the space

F = (

f f =

XP i=1

αik(xi,·) )

,

whereαi ∈ R,k :X × X → Ris thekernel function, which is chosen such that the matrixKij = k(x_i,x_j)is symmetric and positive definite. The following Proposition establishes an inner product and makesFinto a Hilbert space, which is essential for kernel methods.

Proposition 1.2.1. Letf, g∈ F,f =PP

i=1αik(xi,·)andg=PP

j=1βjk(xj,·). Then hf, giF =

XP i=1

XP j=1

αiβjk(xi,xj) (1.1)

is an inner product onF,i.e. it is symmetric, bilinear and positive definite. Moreover, theFcan be made complete to form a Hilbert space together with the inner producth·,·iF.

Definition 1.2.1(Reproducing property). The reproducing property is an immediate consequence of (1.1):

hf, k(x,·)iF =f(x). (1.2)

Due to this property,Fis called the Reproducing Kernel Hilbert Space (RKHS) corresponding tok. Remark. An immediate consequence of the reproducing property is

hk(xi,·), k(xj,·)iF =k(xi,xj).

(19)

This gives rise to the kernel trick. Whenever an algorithm is formulated solely in terms of inner products, one can replace this inner product with any kernel functionkto get an alternative algorithm.

Equivalently, one can switch from samples in the input spacex∈ X toφ(x) =k(x,·) ∈ F just by defining a suitable kernel function, without changing anything else in the algorithm. In particular, as the RKHS can be infinite dimensional, one can implicitly work with infinite dimensional vectors without paying an infinite computational cost. For example, the Gaussian Radial Basis Function

k(xi,xj) = expn

−γkxi−xjk²o

corresponds to such RKHS,i.e. φ(x)is now infinite dimensional. This trick can be utilized to “non- linearize” machine learning algorithms by mapping the samples to a higher-dimensional space viaφ, where linear models correspond to more complex non-linear models in the original spaceX.

Another important feature of the inner producth·,·iF that it allows the computation of a norm for anyf ∈ F as

kfkF = q

hf, fiF, which can be utilized in regularized risk minimization problems.

1.2.1.2 Regularized risk minimization

Definition 1.2.2. A loss function`:X × Y × Y →[0,∞)maps a sample, an expected label and a predicted label to a non-negative real number, with the property`(x, y, y⁰) = 0whenevery =y⁰. ` essentially measures the agreement between the expected and predicted label for any sample.

Definition 1.2.3. The expected risk for a functionf ∈ F is given by R[f] =

Z

X ×Y

`(x, y, f(x))dP(X,Y).

Since we do not know the “true” distributionp(X,Y), this definition is of little use. However, we can aim for the minimization of a similar quantity using an empirical estimate ofp, based on the training set{(xi, yi)}^P_i=1.

Definition 1.2.4. The empirical risk for a functionf ∈ Fis given by R_emp[f] = 1

P XP i=1

`(x_i, y_i, f(x_i)).

Example 1.2.4(Least squares). Here we point out a connection between the discriminative model and decision function approaches. Let us consider the case of regression,i.e. Y = R. We specify a discriminative model as

y_i =f(x_i) +ε, ε∼ N(ε|0, β⁻¹),

whereN denotes the Normal distribution. That is, we use the functionf ∈ F and give a probabilistic treatment of the relationf(x_i)∼y_iby adding a Gaussian noiseε. Now the conditional reads

p(yi|xi, f, β) =N(yi|f(xi), β⁻¹) = β

√2πexp

−β

2 (yi−f(xi))²

.

(20)

We specify the loss function as

`(x, y, f(x)) =−lnp(y|x, f, β).

This quantity is called the negative log-likelihood. Computing the logarithm of the Normal distribution and collecting the constant terms gives rise to the empirical risk

R_emp[f] = β 2P

XP i=1

(y_i−f(x_i))²+ const.

A couple of remarks are in order. Essentially, by minimizing the empirical risk, we search for a functionf which maximizes the log-likelihood, or, in other words, a function which is most likely to generate the observed data. For this reason, methods utilizing the negative log-likelihood in the loss function are also called Maximum Likelihood (ML) methods. We also recognize the mean squared error in the empirical risk, which comes from the probability density function of the Normal distribution.

Definition 1.2.5. The regularized risk for a functionf ∈ F is defined as R_reg[f] =R_emp[f] +λΩ[f],

whereΩ[f]is called the regularization term andλis a real number controlling the trade-off between empirical risk minimization and regularization.

Example 1.2.5 (Regularized least squares). Let us consider the previous example with f(x_i) = w^Tφ(xi)where we used the reproducing property (1.2) withφ(xi) = k(xi,·). At this point, it is not yet clear why we have these finite representations; this fact will be shortly ascertained in Theo- rem (1.2.1). Taking a further step, let us specify apriordistribution onwas

p(w|α) =N(w|0, α⁻¹I).

Now we want to findf (or equivalently,w), which maximizes theposterior probability

p(w|Y,X, α, β) =p(Y|X,w, β)p(w|α) = YP i=1

p(yi|xi,w, β)p(w|α), called a Maximum A Posteriori (MAP) solution. The negative log-likelihood turns out to be

−lnp(w|Y,X, α, β) = β 2

XP i=1

yi−w^Tφ(xi)2

+α

2 kwk²+ const.

= β 2

XP i=1

(y_i−f(x_i))²+α

2 kfk²F + const.

∝Remp[f] +λkfk²F,

which is the same as in Example 1.2.4 with an extra regularization termΩ[f] =kfk²F. Just as the ML solution corresponded to an empirical risk minimization problem, the MAP solution corresponds to a regularized risk minimization problem.

(21)

Intuitively, the point of regularized risk minimization is to penalize “complex” members ofF, and can be addressed informally as a mathematical implementation of Occam’s principle. As in the previous example, the regularizer often takes the form ofΩ(kfkF). There is actually a fair amount of theory behind this choice, leading us back to the theory of Reproducing Kernel Hilbert Spaces. First formalized in the Kimeldorf–Wahba representer theorem, a more general result is due to Schölkopf and Smola [9]:

Theorem 1.2.1 (Representer theorem). LetΩ : [0,∞] → Rbe a strictly monotonically increasing function and`: (X ×R×R)^P →Ra loss function. Then each minimizerf ∈ F of

`((x1, y1, f(x1), . . . ,(xP, yP, f(xP)) + Ω (kfkF) admits a representation of the form

f(x) = XP

i=1

αik(xi,x), wherekis the kernel function corresponding toF.

The theorem ensures that even if the RKHS is infinite dimensional, the minimizer lies in the span of the kernel functions centered on the training samples. Most kernel methods are formulated in such a way that some form of the Representer Theorem holds. In particular, the solution of the Regularized Least Squares (RLS) problem in Example 1.2.5 admits a solution of this form. In Chapter 3, an analogous theorem will be presented in the context of metric learning.

1.2.1.3 Support Vector Machines

In their original formulation, Support Vector Machines are binary classifiers,i.e. Y = {−1,+1}^P. The loss function is taken to be the hinge loss

`(x, y, f(x)) = max(0,1−y_if(x_i)),

which penalizes misclassifications,i.e. we incur a loss whenevery_i = +1andf(x_i) = −1orvice versa. With the regularizerΩ[f] = ¹₂kfk², the regularized risk is

R_emp[f] = 1 P

XP i=1

max(0,1−y_if(x_i)) + λ

2 kfk². (1.3)

Although equivalent, (1.3) is not the form usually given in the machine learning literature. Utilizing the Representer Theorem and the reproducing property, we can cast the regularized risk minimization problem into the more familiar primal form

w,b,ξmin 1

2kwk²+C XP i=1

ξ_i, s.t. y_i(w^Tφ(x_i) +b)≥1−ξ_i

ξ ≥0.

It turns out that this formulation also has a very intuitive geometric interpretation, illustrated on Figure 1.2. The optimization problem is equivalent to finding a hyperplane withw^Tφ(x) +b = 0,

(22)

www^Tφ(xxx) +b=0 γ

www

φ(xxx_i),y_i= +1

Figure 1.2. Geometric interpretation of the two-class Support Vector Machine. The goal is to learn a hyperplanew^Tφ(x) +b= 0which separates the two classes with maximum margin, denoted byγ. which separates the images of the training samples φ(x_i) based on their respective classesy_i. In particular, we wantw^Tφ(x_i) +b≥1−ξ_ifor samples withy_i= +1andw^Tφ(x) +b≤ −1 +ξ_ifor samples withyi=−1.Ccontrols the model complexity through the slack variablesξcorresponding to the soft-margin property,i.e.how tolerant the algorithm should be of samples on the “wrong side”

of the hyperplane. Although the separation is linear, note that it takes place in the RKHS and can be nonlinear inX. The dual problem is

maxα 1^Tα−1

2α^TQα

s.t. 0≤α≤C, y^Tα= 0,

whereQij =yiyjKij. The class of an unknown sample can be predicted as y= sgn

XP i=1

α_iy_ik(x_i,x)

! .

1.2.2 Inference in Bayesian models

In the previous Section, we covered supervised learning, focusing on the discriminative setting. Now we give a very brief overview of fully Bayesian models. Let all observed variables be absorbed into a variableX and similarly, all hidden variables (including parameters) intoZ. Fully Bayesian models are, in a sense, always generative as they model the joint distributionp(X,Z)explicitly. Given the observations, obtaining probabilistic predictions of any hidden variable is simple in theory; one has to apply Bayes’ rule:

p(Z|X) = p(X|Z)p(Z) Rp(X|Z)p(Z)dZ,

(23)

wherep(X|Z)is the data likelihood andp(Z)is the prior on the hidden variables. Most of the time, this is intractable and one has to resort to approximative techniques.

In the next Section, we overview two techniques to perform inference in intractable models, which will be applied in Chapter 4.

• Variational Bayesian (VB) methods [10] maintain a family of probabilistic approximations q(Z) ∈ Q and aim to select a member of Q which minimizes some divergence measure to p(Z|X). In general, these methods are not guaranteed to converge to the correct posterior, as p(Z|X)is usually not part of the variational family; in fact,q(Z)is chosen to be sufficiently

“simple” for effective optimization. In turn, they tend to be faster and more scalable than sampling methods.

• Sampling methods, such as Markov Chain Monte Carlo (MCMC) methods, approximate the posterior by drawing a large number of samples using random walk methods. In general, such methods require a large number of random samples, thus are more computationally intensive but asymptotically exact,i.e.they converge to the true posterior.

1.2.2.1 Variational Bayesian inference

LetQdenote a family of probabilistic approximations top(Z|X). VB methods aim to findq^∗(Z)∈ Q which minimizes the Kullback–Leibler divergence

q^∗(Z) = arg min

q(Z)∈Q

KL(q(Z)||p(Z|X)).

Unfortunately, this quantity still depends onp(X), which contains an intractable integral and therefore cannot be computed directly. Instead, VB aims to maximize a lower boundL(q(Z))on the log model evidencelnp(X)with respect toq(Z). In particular,lnp(X)can be decomposed as

lnp(X) = ln Z

p(X|Z)p(Z)dZ =L(q(Z)) +KL(q(Z)||p(Z|X)), (1.4) where

L(q(Z)) = Z

q(Z) ln

p(X,Z) q(Z)

dZ, KL(q(Z)||p(Z|X)) =−

Z

q(Z) ln

p(Z|X) q(Z)

dZ,

andL(q(Z))is called the evidence lower bound (ELBO). Sincelnp(X)does not depend onZ and KL(q||p)≥0, maximizing the ELBO is equivalent to minimizing the Kullback–Leibler divergence.

Remark. Although the Kullback–Leibler divergence is by far the most commonly used, there are other alternatives, such as theα-divergence family, which contains the former as a special case [11]. In fact, information geometry has a very general differential geometric treatment of divergence functions, which includeα-divergences,f-divergences, Bregman divergences and more [7].

In practice, one either restricts the familyQ explicitly to have a “simple” parametric form and optimize the parameters, or takes the mean field approach and factorsq(Z)as

q(Z) =Y

i

qi(zi).

(24)

exp

E

ZZZ\z_i

[ln { p(XXX , ZZZ) } ] q

i

(z

i

)

KL(q

_i

|| q

^∗_i

)

p(ZZZ | XXX ) KL(q

_i

|| p)

Q ⁱ

Figure 1.3. Mean field variational Bayesian inference. The update forq_i(z_i)is computed by maximizing the ELBO, or equivalently, minimizing the KL divergence top(Z|X).

A particularly nice property of this approach is that one does not have to assume the parametric form of the individual factorsq_i. In many cases, it turns out that their optimal form is automatically deter- mined by the model. Another convenient property is that the factors can approximate the marginal distributions of the latent variablesz_idirectly, although VB cannot capture correlations between them.

Isolating theith factor in the ELBO gives L(qi(zi)) =

Z

qi(zi)E^Z\zi[lnp(X,Z)]dzi− Z

qi(zi) lnqi(zi)dzi+ const.

=−KL(q_i(z_i)||q_i^∗(z_i)) + const.

where

EZ\zi[lnp(X,Z)] = Z

· · · Z

lnp(X,Z)Y

j6=i

qj(zj)dzj

andq^∗_i(z_i)is defined by

lnq_i^∗(zi) =E^Z\zi[lnp(X,Z)] + const. (1.5) or equivalently,

q_i^∗(zi)∝exp

E^Z\zi[lnp(X,Z)] (1.6)

Since the ELBO is the negative Kullback–Leibler divergence betweenqi andq_i^∗, its maximum value is zero, which is attained wheneverq_i = q_i^∗,i.e. q_i^∗ indeed is the optimal distribution. This gives rise to an iterative algorithm which cycles through the factorsq_i(z_i)and updates them using (1.6) while holding the other factors fixed. Convergence is guaranteed since the ELBO increases in every iteration.

(25)

Remark. Equation (1.6) has a counterpart among sampling methods, namely, Gibbs sampling utilizes the same iterative scheme using the complete conditional to perform the updates. Gibbs sampling will be addressed in detail in the next Section.

Definition 1.2.6(Conditionally conjugate distributions). LetSbe a family of distributionsp(y|φ, θ) andP a family of priorsp(φ|θ). ThenP is conditionally conjugate toS ifp(φ|y, θ) ∈ P,i.e. the posterior has the same form as the prior.

Conjugacy is very important both in mean field variational approximation and Gibbs sampling.

In conjugate models, the update (1.6) is available in closed form, therefore, it is very easy to derive coordinate ascent or sampling algorithms. However, there is a large class of models which are non- conjugate.

Example 1.2.6 (Bayesian logistic regression). Despite the name, Bayesian logistic regression is a binary classification model. It gives the probability of a binary labelyas

p(y_i|x_i,w) =Bern(y_i|σ(w^Tφ(x_i))),

where theσis called the sigmoid function and is defined asσ(x) = _1+exp{−x}¹ ∈(0,1). The prior is specified as

p(w|µ,Σ) =N(w|µ,Σ).

Putting the two together and assuming independence between samples, the log likelihood lnp(w|X,Y,µ,Σ) = lnp(Y|X,w) + lnp(w|µ,Σ) + const.

can be written as

−1

2(w−µ)^TΣ⁻¹(w−µ) +X

i

yilnσ(w^Tφ(xi)) + (1−yi) ln(1−σ(w^Tφ(xi))).

It is clear that this model is not conjugate – collecting the terms w certainly does not result in a Gaussian distribution. In practice, there are two methods to circumvent this problem. Laplace approximation to the non-conjugate part gives a term quadratic inwwhich can be absorbed into the normal part and the resulting conditional can be computed easily,i.e.the likelihood is approximated with a Normal distribution. A more accurate approach is variational approximation with an alternate lower bound due to Jaakkolaet al.[12]. The latter will be utilized in Chapter 4.

Remark. In the past years, there has been a significant amount of work on general black-box variational inference on non-conjugate models, with the aim of automating tedious manual computations usually involved in deriving variational optimization strategies. Two of these packages are Automatic Differentiation Variational Inference (ADVI [13]) and Edward [14].

1.2.2.2 Sampling methods

Sampling methods approximatep(Z|X)by drawing a large number of samples from the distribution independently; ideally, this approximate posterior converges to the true posterior as the sample size increases. A very useful property of sampling methods is that they often require the ability to evaluate the unnormalized distribution only, which is fairly easy to compute even with highly complex models.

For notational convenience, we absorb all variables intoZin this Section.

(26)

z u

Figure 1.4. Slice sampling alternates between drawingu^tuniformly from[0,p(zˆ ^t)]and drawingz^t+1 from the slice

z^t+1 : ˆp(z^t+1)> u^t wherep(z)ˆ is the unnormalized distribution.

Definition 1.2.7. A first-order Markov chain is a sequence of random variablesZ^t, such that p(Z^t+1|Z¹, . . . ,Z^t) =p(Z^t+1|Z^t).

This quantity is calledtransition probability and the Markov chain is said to behomogeneous if the transition probabilities are the same for allm. An invariant distributionp^∗ of a homogeneous Markov chain satisfies

p^∗(Z) =X

Z⁰

p(Z|Z⁰)p^∗(Z⁰),

i.e. it does not change as the chain progresses. The key idea now is to construct a Markov chain which has the target distribution as an invariant distribution and ensure that p(Z^t) converges to it ast → ∞. If the chain has this property, it is called ergodic and the invariant distribution is itsequilibrium distribution. With the target distribution being its equilibrium distribution, one can utilize samplesZ^tfrom the Markov chain to perform inference. The collective term for this family of methods is Markov Chain Monte Carlo (MCMC) [15].

A particularly simple and powerful MCMC technique is Gibbs sampling. As indicated in the previous Section, Gibbs sampling updates the variablesziof

p(Z) =p(z1, z2, . . .)

in an iterative manner. In particular, it chooses a variableziin each turn and draws its new value from p(z_i|Z\z_i). Therefore, the invariance property is ensured by definition. Moreover, it can be shown that if the conditionals are not zero anywhere, ergodicity also holds.

The main difficulty in Gibbs sampling is computing the conditionalsp(zi|Z\zi). Whenever the model is conditionally conjugate, this is mostly straightforward, if somewhat tedious, using earlier results. In fact, there are compendiums with detailed update rules for a large number of conjugate priors [16]. Non-conjugate models, however, can lead to intractable integrals, and in general, complicated distributions with many modes. In Chapter 4, we apply a slice sampling step within the Gibbs sampling algorithm to circumvent this difficulty.

Slice sampling can sample from more complicated distributions as it only requires the evaluation of the unnormalized distribution. We present it in the univariate case. The idea is to introduce an

(27)

auxiliary variableu and sample the area under the unnormalized distribution p(z)ˆ by alternating between drawingzandu:

0. Choose a startingz⁰ ∈supp(ˆp).

1. Givenz^t, evaluatep(z^t)and sampleu^tuniformly from[0,p(zˆ ^t)]. 2. Givenu^t, samplez^t+1from the “slice”

z^t+1: ˆp(z^t+1)> u^t .

3. Discardu^tand repeat from 1 until the desired number of samples have been drawn.

In practice, determining the slice is not straightforward as it may not be possible to directly solve for its boundaries. A number of approaches have been suggested, including the dynamic extension and shrinking of the interval and using numerical optimization to get the boundaries.

1.3 Computational drug discovery

The term “drug discovery” refers to the process of finding candidate compounds which have the potential to become approved drugs, and therefore, further experimental study is needed to explore their properties. Traditionally, drug discovery was guided by serendipitous observations, and it was not until the late 90’s that High Throughput Screening (HTS) methods were integrated into the drug discovery workflow. This experimental process utilizes large molecular libraries, which are screened against biologically relevant targets (e.g. receptors, enzymes), and the selection ofhitswhich exhibit sufficient activity. The most promising hits are re-evaluated, becomeleads, and subsequently undergo lead optimization, in which their structure is modified to enhance desirable properties. The process results in candidate compounds, ready for the preclinical phase of drug development.

Similarly to other sub-fields of bioinformatics, these novel measurement technologies have pro- duced a staggering amount of experimental data. Chemoinformatics concerns itself with managing, integrating, analyzing and building predictive models of these data. Although the technology ad- vances in a rapid pace, HTS methods are still costly and call for complementaryin silico methods which utilize the toolkit of chemoinformatics and machine learning to fulfill very similar goals.

As the costs of drug development are steadily rising and the number of approved New Molecu- lar Entities (NMEs) are stagnating or decreasing every year, the pharmaceutical industry has began to explore alternative strategies [17]. A remarkable joint effort of leading companies in the pharma industry and the academic world was the Open PHACTS project [1], which aimed to build a publicly available drug discovery platform. This consists of a comprehensive semantic database integrating major biomedical databases, open source tools and software, and even proprietary data from pharmaceutical companies, with the goal of removing bottlenecks in the drug discovery process.

Machine learning methods are especially well-suited to cope with the data deluge now present in the chemoinformatics world as well. Predictive models are routinely used in various contexts, e.g. predicting physical properties, interactions or affinities. The general drug–target interaction prediction problem was already attacked in the 90’s in single-target scenarios, e.g. by using neu- ral networks [18] and kernel methods [19, 20]. Similarity-based techniques were also developed for virtual screening [21–23]; in the early 2000’s, molecular docking became widespread [24, 25]; from the late 2000’s matrix factorization methods were developed [26–28]. As the importance of data and knowledge integration was further emphasized [29–32], the incorporation of multiple sources of prior knowledge has become mainstream and indeed improved predictive performance [28, 33–35]. In the following Sections, we overview computational aspects of drug discovery which are directly relevant to our work.

(28)

Table 1.1. Availability of information sources in computational drug repositioning compared to HTS [36].

Ligand profile HTS Repositioning Dimension

Chemical X X 10²−10¹⁰

Target profile limited X 10⁴−10⁵

Taxonomy × X 3−5(depth)

Side-effect × X 10⁴

Literature × X 10⁴−10⁷

Expression limited X 10⁴−10⁵

Off-label use × X 10⁴

1.3.1 Representation of compounds

Ligand-based computational drug discovery strategies require a formal representation of compounds (in this context, also known as ligands), which can subsequently enter intoe.g.Virtual Screening (VS) methods, where similar compounds to known actives are selected, or Quantitative Structure Activity Relationship (QSAR) models which predict biochemical properties directly from the representations.

These representations are calledfingerprintsordescriptors. New molecular fingerprints are constantly being developed, engineered in such a way that they reflect relevant physical, chemical, even biological properties, such as

• Physical and chemical features,e.g.solubility, surface area, weight, atom types, electronegativ- ityetc.,

• Structural and topological features,e.g. paths, rings, symmetry, bond distances, geometryetc., including 2D and 3D descriptors,

• Functional and pharmacophore groups, user-defined substructures,

• Non-chemical properties,e.g.side-effect profiles, gene expression profiles, text mining.

In almost all cases, fingerprints take the form of real or binary vectors. As an example, one of the most widely used is the Molecular ACCess System keys (MACCS) fingerprint, which contains166 binary variables, each corresponding to the presence of a structural feature, such as an S-S bond (14), a C=CN bond (45), an O-heterocycle (57)etc.Another widely used one is the Extended Connectivity Fingerprint (ECFP), which hashes substructure features of circular neighborhoods of non-hydrogen atoms onto a variable-length sparse binary string, which is subsequently folded into a 1024-bit representation.

Non-chemical properties are usually not available in early screening studies. However, ligand- based methods can also be applied indrug repositioning, where the goal is to find new indications for already approved drugs, and significantly more information are available. Given its improved rate of success and generally lower costs [17], drug repositioning can present a viable alternative tode novo drug discovery. The information sources available in later phases are listed in Table 4.1.

1.3.2 Similarity and virtual screening

The “similar property principle” refers to the observation that structurally similar molecules tend to have similar properties, including physicochemical features and biological effects [37]. This simple

(29)

and fairly intuitive concept lies at the heart of ligand-based virtual screening methods. The key idea is to develop a quantitative measure of structural similarity using molecular fingerprints, and utilize this notion to find compounds with desirable properties. An extended version of the similar property principle can involve non-structural aspects as well. For example, one might be interested in defining similarities over side effect profiles, which could directly measure the similarity of the effects of the molecules in living organisms, indicating the involvement of a common pathway or cellular mechanism, and could be exploited in developing synergistic drug combinations.

Chemical similarity measures are direct analogues of kernel functions in machine learning. The most widely applied measure, the Tanimoto similarity indeed turns out to be symmetric and positive definite, satisfying the requirements of a kernel function:

T(xi,xj) = hxi,xji

kx_ik²+kx_ik²− hx_i,x_ji

For the MACCS fingerprint and other binary descriptions, this is just the Jaccard index between the setsxiandxjcontaining various structural features.

Now we are in a position to define virutal screening (VS). Exploiting the similar property principle, VS seeks to find ligands with a high similarity to an already known active compound. The concept has been extended to multiple descriptions, multiple similarities [21, 22, 38, 39] and multiple known actives involving many-to-one similarities (“group fusion“) [40]. The Similarity Ensemble Approach (SEA) by Keiseret al. utilizes sets of ligands to characterize given targets [41] and by this approach, extends the notion of similarity over entities other than compounds. Target–target similarities can also be defined directly using sequence similarities, common motifs and domains, phylogenetic relations or shared binding sites and pockets [42]. In the case of indirect drug–target interactions, a wider set of target–target similarities can be defined on the basis of relatedness in pathways, protein- protein interaction networks and functional annotations,e.g. from Gene Ontology [43]. Despite the practical validations of using multiple similarities in numerous virtual screening settings, theoretical foundations and methodological guidelines on virtual screening methods are still scarce [38, 44].

Beyond the similar property principle, the drug/lead-likeliness of a compound [45, 46] and druggability of proteins [47] are also essential concepts in the drug discovery context, together with molecular docking [24, 25] and binding site, pocket predictors [42]. However, their utilization as priors is still mostly unexplored. The use of molecular interaction and regulatory networks alongside protein- protein similarities is another open issue,e.g.when the goal is to discover indirect drug–target interactions, possibly involving multiple pathways, which are especially relevant in polypharmacology [48].

The automated weighting of the information sources, representations and similarities (in general, the more complex combination of multiple representations) has also remained an open question; in particular, statistical guarantees were missing. Machine learning techniques are ideal candidates to handle data fusion in a statistically optimal way. They already have been applied in manyde novovir- tual screening and prediction tasks [20, 49–51]; in particular, the kernel approach has quickly become a method of choice in these scenarios.

1.3.3 Drug–target interaction prediction

Due to their public availability and relative objectivity, drug–target interaction (DTI) data have become a fundamental resource in pharmaceutical research [32, 52–57]. This objectivity rendered a unique status to comprehensive DTI data, even compared to media and e-commerce data [58], despite the questions of quality [59, 60], issues with parallel commercial and public repositories [61–63] and selection bias related to the lack of negative samples [64] and promiscuity [65]. Their rapid growth

Multiple kernel data and knowledge fusion methods in drug discovery

fusion methods in drug discovery

Bence Bolgár

Péter Antal, PhD (BME)

Declaration of own work and references

Nyilatkozat önálló munkáról, hivatkozások átvételéről

Acknowledgements

Summary

Összefoglaló

Introduction

1.1 Knowledge and data fusion

1995 2000 2005 2010

1.2 Machine learning

exp

E

[ln { p(XXX , ZZZ) } ] q

(z

)

KL(q

|| q

)

p(ZZZ | XXX ) KL(q

|| p)

Q i

1.3 Computational drug discovery

Q ⁱ