Linear methods for quantitative prediction

In regression setting our goal is to build a quantitative model of one or more outcome variables using features also called independent or explanatory variables or covariates. In chemoinformatics the main applications for regression models is the field of quantitative structure-activity relationship modelling (QSAR). For the discussion of this setting, let us assume that the features are organized in an N-by-F matrix X where a row corresponds to a sample – here compound – and a column corresponds to a feature. Furthermore let us organize all outcome variables to an N-by-M matrix Y, where a row corresponds to a sample and a column to an outcome variable. See the illustration on Figure 10. In the following discussion - if otherwise not specified - we will work with a single outcome (univariate regression). In this case Y is a column vector.

Ordinary Least Squares (OLS) is the simplest form of regression methods, which can be used to predict compound activities. The name comes from the fact that the method minimizes the squared error between the prediction and the known outcome:

where β is a vector of model parameters, interpreted as weights on the elements of the feature set. The vector xi is a row of the matrix X corresponding to sample i, and yi is the value of the outcome variable corresponding to sample i.

It can be shown that the β for which the above error term is minimal can be calculated as:

where the expression multiplying y is called the Moore-Penrose pseudoinverse of X. If two features are linearly dependent – they differ only by a linear transformation plus a small deviation term – exchanging and transforming the two β values will result in similar predictions. From the other way around, a small change in y would result in a huge change in some β values. In mathematical terms the condition number of the matrix will be large.

Even if the problem is numerically stable, models with a high dimensional feature set Figure 10 - The structure of a linear regression problem in its general multivariate form: X is a sample by feature matrix containing the samples of the covariates, Y is the

outcome matrix, and β is an outcome by feature weight matrix, containing the model parameters. As a convention we add a feature which is always one, and the β

corresponding to that feature is the bias of the model.

trained on a small number of training samples can have suboptimal performance. See the topic of over-fitting discussed in chapter 3.16. We can ameliorate these problems if we introduce a constraint to restrict the space of possible models, often called regularization.

One possible way is to reduce the actual dimensionality of the feature set by Principal Component Analysis (PCA). PCA will find new derived features which are uncorrelated.

It can be interpreted as finding a transformation of the coordinate system to minimize correlation between the new variables (see Figure 11). In this case feature 1 and 2 are nearly linearly dependent, which would cause numerical instabilities during the computation of the Moore-Penrose pseudoinverse. On the other hand, the principal component corresponding to the largest variance (PC1) and the second principal component (PC2) are totally independent, and PC2, which corresponds to the deviation from the linear dependence, has small variance. If the two features were perfectly dependent, PC2 would have zero variance. More formally PCA finds two matrices U and V satisfying

where U is a sample-by-principal component matrix, called the score matrix, and V is a feature-by-principal component matrix called the loading matrix. The rows of U, or simply scores, describe the samples in the new space, as plotted on Figure 11. The rows of V, or loadings, however, define the transformation from the original to the new feature space.

The technique called Principal Component Regression (PCR) is the sequential composition of a PCA step on the features followed by an OLS regression. In this case we use only the principal components with the highest variance to make predictions, by using the truncated scores as features in OLS. In some cases, however, a principal component with lower variance can have equal or even higher importance. This problem arises from the fact that the creation and selection of the principal components do not depend on the outcome value y; they are selected an unsupervised way.

The most popular method applied in chemometrics is the Partial Least Squares (PLS) regression. In PLS the selection of the latent variables is a supervised procedure. The method projects the features and the prediction target or targets to new spaces with constrained dimensionality [66]:

with the optimization criteria to maximize the covariance between these derived variables:

Having these representations, the method finds a regression model between these two spaces:

Because Ti and Ui are corresponding latent variables with the highest covariance, this regression problem falls back to independent univariate problems: the D matrix we search for is diagonal.

As the OLS can be interpreted as maximization of the correlation, while the PCR selects latent variables according to the maximal variance criterion, PLS is a trade-off between these two cases [67].

The more general case of PLS briefly discussed above is called the PLS2, which can regress for several outcome variables together. This can help to improve predictions compared to building separate regression models. This principle is called multi-task learning in the machine learning literature [68]. We will use the same effect in matrix

Figure 11 - Illustration of Principal Component Analysis (PCA) on the case of two strongly dependent features.

factorization models (see Section 3.17). If the prediction of only one outcome is needed, the PLS algorithm simplifies to a variant called PLS1 [69].

In document Prediction of biological activity using heterogeneous information sources (Pldal 45-49)