Advanced methods of statistical machine learning and applications / Armin Fanzott



© Alpen-Adria-Universität Klagenfurttemplate_title_page_thesis.docx (External Design of University Publications | ÖNORM A 2662)version 2014-08-28

Armin Fanzott

Advanced Methods of Statistical Machine

Learning and Applications


submitted in fulfilment of the requirements for the degree of Diplom-Ingenieur

Programme: Master's programme Technical Mathematics

Alpen-Adria-Universität Klagenfurt


O.Univ.-Prof. Dr. Jürgen Pilz Alpen-Adria-Universität Klagenfurt Institut für Statistik



Firstly, I would like to thank my supervisor O.Univ.-Prof. Dr. J¨urgen Pilz for his support, dedication and useful critiques of the present work.

Finally, I have to express my deep gratitude to my girlfriend and my parents for pro-viding me with ongoing support and continuous encouragement throughout my years of study and through the process of writing this thesis. This accomplishment would not have been possible without them. Thank you.



I hereby declare in lieu of an oath that

- the submitted academic thesis is entirely my own work and that no auxiliary materials have been used other than those indicated,

- I have fully disclosed all assistance received from third parties during the process of writing the thesis, including any significant advice from supervisors,

- any contents taken from the works of third parties or my own works that have been included either literally or in spirit have been appropriately marked and the respective source of the information has been clearly identified with precise bibliographical references (e.g. in footnotes),

- to date, I have not submitted this thesis to an examining authority either in Austria or abroad and that

- when passing on copies of the academic thesis (e.g. in bound, printed or digital form), I will ensure that each copy is fully consistent with the submitted digital version.

I understand that the digital version of the academic thesis submitted will be used for the purpose of conducting a plagiarism assessment.

I am aware that a declaration contrary to the facts will have legal consequences.

Armin Fanzott, BSc e.h. Klagenfurt, January 2nd 2019

(Signature) (Place, date)



Statistical machine learning algorithms have risen in popularity in recent years due to improved computational power and its various applications troughout almost all industries. Whether it is marketing and sales, where the aim is prediction of costumer churn probabilities, engineering and maintenance, where it is used to predict equipment reliability, or the pharmaceutical industry, all use machine learning to reduce costs, im-prove efficiency and increase personalization of their products. Recent advancements also provide solutions for highly difficult tasks in image (e.g. face recognition)-, audio (e.g. chatbots)-, and video processing (e.g. autonomous driving). Though machine learning can be expected to be one of the most disruptive technologies in the upcom-ing years.

The present master thesis provides the necessary concepts and theories of statistical machine learning and also focuses on the implementation of the discussed algorithms in the open source programming language R ([R Core Team, 2016]).

The first chapter offers an extensive introduction into machine learning and statistical decision theory in general.

Ensemble learning methods, like AdaBoost or Gradient Boosting, are introduced in the second chapter with a broad emphasis on the implementation of the discussed al-gorithms by using the R-packages GBM ([Ridgeway, 2007]) and XGBoost

([Chen and Guestrin, 2016]).

Nonlinear dimensionality reduction techniques to reduce collinearity and improve com-putational time consumption are the main focus of the third and final chapter. There-fore the so called manifold learning problem is presented and algorithms like Isomap, Locally Linear Embedding and Laplacian Eigenmaps are used to solve it. Their re-spective R implementations are shown on the famous Swiss-roll dataset.



List of Tables . . . 3

List of Figures . . . 3

1 Introduction to Machine Learning and Statistical Decision Theory . . . . 6

1.1 Types of Learning Algorithms . . . 6

1.1.1 Supervised Learning . . . 6

1.1.2 Unsupervised Learning . . . 9

1.2 Introduction to Statistical Decision Theory . . . 10

1.2.1 Bayes Classifier . . . 10

1.2.2 Loss functions and Expected Loss . . . 12

2 Ensemble Learning . . . 21

2.1 Tree-based Methods . . . 21

2.1.1 Regression Trees . . . 22

2.1.2 Classification Trees . . . 25

2.2 Boosting . . . 27

2.2.1 Strong- and Weak Learnability . . . 27

2.2.2 Numerical Optimization . . . 28

2.2.3 Boosting as Additive Modelling . . . 31

2.2.4 Gradient Boosting . . . 35

2.3 Implementation . . . 43

2.3.1 GBM package . . . 43

2.3.2 XGBoost package . . . 55

3 Nonlinear Dimensionality Reduction . . . 65

3.1 Manifold Learning . . . 66

3.1.1 Isomap . . . 66

3.1.2 Locally Linear Embedding . . . 68

3.1.3 Laplacian Eigenmaps . . . 70

3.2 Implementation . . . 71


Chapter 0: CONTENTS 2


List of Tables

1.1 Notation for predictors and response. . . 7 2.1 Advantages and disadvantages of decision trees. . . 27 2.2 Comparison of different statistical learning algorithms (• • • = Good,

•• = Average, • = Bad), [Hastie et al., 2017]. . . 27

2.3 XGBoost compared to other implementations of gradient boosting [Chen and Guestrin, 2016]. 44 2.4 Optimal number of iterations with respect to CV-error and time

consump-tion in R (without CV) with gaussian as the loss funcconsump-tion for different settings of the learning rate and n using the GBM-package. . . 52 2.5 Optimal number of iterations with respect to CV-error, CV-error and

time consumption in R (without CV) with laplace as the loss function for different settings of the learning rate and n using the GBM-package. . . 52 2.6 Mean Test MAEs of the XGBoost model with a learning rate of 0.3,

62 trees and different parameter settings. The optimal parameters are highlighted. . . 63 2.7 Mean Test MAEs of the XGBoost model with a learning rate of 0.05, 470

trees and different parameter settings. The optimal parameters are again highlighted. . . 63


List of Figures

1.1 Handwritten digits from U.S. postal envelopes [Hastie et al., 2017]. . . . 8

1.2 Types of Clustering algorithms [Gollapudi, 2018]. . . 9

1.3 Scatter plots of simulated data predicted with 15-nn classifier (left) and Bayes classifier (right) . . . 12

1.4 Squared-Error Loss (solid) and Absolute-Error Loss (dashed). . . 17

1.5 Huber Loss for parameter ξ = 0.5, 1, 1.5. . . 18

1.6 Log-Loss & Zero-One Loss for true label y = 0. . . 19

2.1 Partition of the predictor space (X1, X2) into separate regions/rectangles Rm(left) and recursive splitting through if-then statements(right) . . . . 23

2.2 Flowchart of the AdaBoost algorithm. . . 33

2.3 Schematic of 5-fold CV [Gareth et al., 2013]. . . 42

2.4 Kernel-density estimation of the loss response from the Allstate data. . . 47

2.5 Scatter plot of the continuous variables cont11 and cont12. . . 48

2.6 Training (black),- validation (red) and cross-validation(green) squared-error loss for shrinkage = 1. . . 50

2.7 Training- (black),and cross-validation(green) absolute-error loss for no shrinkage η = 1. . . 53

2.8 True density (black), predicted densities by using absolute-error loss (red) and squared-error loss (blue) with the best computed parameter settings. 53 2.9 Training- and test mean-absolute error for up to 100 iterations of the default XGBoost model predicting the loss of the Allstate data. . . 60

3.1 Illustration of the Isomap algorithm [Tenenbaum et al., 2000]. . . 68

3.2 Swiss roll dataset generated in R with N = 1000 observations. . . 72

3.3 (1 − ρ2 DX,DX0) for k = 1 to k = 50. . . 74

3.4 Visualization of the smaller 2-dimensional dataset X0 derived from the swiss roll dataset by using Isomap with k = 5 (left) and k = 50 (right). The red line shows the constructed neighbourhood graph G. . . 74

3.5 Visualization of the smaller 2-dimensional dataset X0 derived from the swiss roll dataset for k = 50 using Locally Linear Embedding. . . 77


Chapter 0: LIST OF FIGURES 5

3.6 Visualization of the smaller 2-dimensional dataset X0 derived from the swiss roll dataset for k = 50 using Laplacian Eigenmaps. . . 78


Chapter 1

Introduction to Machine Learning

and Statistical Decision Theory


Types of Learning Algorithms

Basically there exist four different types of learning algorithms, supervised-, unsupervised-, reinforcement-unsupervised-, and semi-supervised learning. In this chapter supervised- and unsu-pervised learning will be introduced. For more information on the other types consider [Camastra and Vinciarelli, 2015].


Supervised Learning

In the supervised learning approach a set of n observations or instances {x1, x2, ..., xn} is given, where every observation xi = (xi1, ..., xip), i = 1, ..., n, is a realisation of a

p-dimensional random vector X = (X1, ..., Xp). The random variables Xj, j = 1, ..., p

are called predictors, covariates or features and X : Ω → Rp. Further a vector of n realisations y = (y1, ..., yn)T of a random variable Y , called response or output, is given

where every yi belongs to a corresponding xi, i = 1, ..., n. The random vector X and

Y are connected via a joint probability distribution P (X, Y ). The set of observations can also be interpreted as an (n × p)-matrix

X =       x1 x2 .. . xn       =     x11 x12 . . . x1p .. . . .. ... ... xn1 xn2 . . . xnp    

where every row represents an observation and every column represents a predictor. The matrix (X, y) is called dataset or data matrix. The predictors Xj, j = 1, ..., p can

either be continuous (i.e. with realisations xj ∈ R) like costs or areas, or categorical


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 7

like gender or eye color.

Now, in supervised learning a function F : X → Y mapping xi 7→ yi is desired to

pre-dict y for a new set of observations {xn+1, ..., xm} where y = (yn+1, ..., ym)T is unknown.

If the response Y is a continuous random variable the modelling is called regression, if it is categorical it is called classification.

The notations for observations and predictors are quite different throughout the lit-erature. To prevent misunderstandings the notation in Table 1.1 will be used in this master thesis.

Xj, j = 1, ..., p predictor

X = (X1, ..., Xp) predictor vector

Y response

xi, i = 1, ..., n observation, i.e. realisation of X yi, i = 1, ..., n realisation of Y corresponding to xi

Table 1.1: Notation for predictors and response.

Remark 1 (Design Matrix). If X is combined with a vector consisting only of 1s such that X =     1 x11 x12 . . . x1p .. . ... . .. ... ... 1 xn1 xn2 . . . xnp    

then it is called design matrix. The design matrix plays an important role in a lot of statistical machine learning algorithms (e.g. Multiple Linear Regression).

Remark 2 (Dummy Variables). Some statistical methods like Linear Regression cannot naturally deal with categorical predictors. Thus, those predictors must be transformed to a numerical range. Consider a two-class categorical predictor Xj with classes C1

and C2. Xj can be easily transformed to a binary range with

Xj =

 

1 if i-th observation belongs to class C1

0 if i-th observation belongs to class C2


for i = 1, ..., n. Now, assume that Xj is a categorical predictor with m classes C1, ..., Cm,


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 8 dummy variables Xj1 =   

1 if i-th observation belongs to class C1

0 if i-th observation belongs not to class C1

(1.2) .. . Xj(m−1) =   

1 if i-th observation belongs to class Cm−1

0 if i-th observation belongs not to class Cm−1


have to be created and added to the model.

Example 1 (Handwritten digit recognition). A classic example of a classification task is the problem of handwritten digit recognition. A dataset is given where every predictor Xj gives information for a certain pixel and every observation xi has a label yi giving

the actual handwritten digit. Hence every yi belongs to a certain class. The goal is now

to predict the digit for a given vector xi where yi is unknown.

The MNIST1 database is used in many publications for the comparison of the

per-formance of different algorithms. It consists of 60000 examples of handwritten digits ranging from 0 to 9 (so 10 classes in total) and is freely available. The error rates are updated regularly on the publishers homepage. Figure 1.1 shows an example of handwritten digits from U.S. postal envelopes.

Figure 1.1: Handwritten digits from U.S. postal envelopes [Hastie et al., 2017].


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 9


Unsupervised Learning

Unlike supervised learning, in unsupervised learning there exists no response Y . This means there is no variable which should be modelled and thus the aim of unsupervised learning algorithms is to identify structures in the data. The most popular algorithms in this area are the so called clustering algorithms which try to assign observations xi to homogeneous subgroups (clusters) of all observations. The differences between objects in the same cluster should be as marginal as possible, whereas the differences between objects in different clusters should be as large as possible.

Clustering methods can be mainly separated into hierarchical and non-hierarchical or partitional algorithms. Hierarchical algorithms can again be separated into agglom-erative and divisive algorithms. Figure 1.2 illustrates the different types of clustering algorithms.

Hierarchical Start with one observation (agglomerative) or all observations (divisive) and build, respectively split, clusters to get more appropriate clusters. Observa-tions cannot be switched between built clusters.

Partitional Start with a predefined partition into clusters. Interchange observations between the clusters till a defined objective function reaches its optimum. In contrast to hierarchical clustering new clusters cannot be built.

Figure 1.2: Types of Clustering algorithms [Gollapudi, 2018].

Example 2 (Marketing Analytics). In marketing analytics companies try to better understand and categorize their costumer base. Hence they use clustering algorithms to separate their costumers into groups to better address their individual needs. Common predictors in marketing analytics are for example age, income, education, etc.


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 10

The algorithms discussed in this master thesis will be part of the group of supervised learning algorithms, thus practical considerations will be based on them in the next chapters.


Introduction to Statistical Decision Theory

As described in section 1.1.1 a function F : X → Y is desired which maps a given input x to the target y. X and Y are connected via a joint probability distribution P (X, Y ). Hence supervised learning can be viewed as the task of function estimation. Thus the goal is to obtain an estimation of the function F which minimizes the expected value of some specified loss function L(y, F (x)) over the joint probability distribution [Friedman, 2001]

F = arg min


Y,XL(Y, F (X)) = arg min



EY L Y, F (X)|X = x. (1.4)

The loss function can be interpreted as a function which penalizes prediction errors according to [Hastie et al., 2017]. To better understand loss functions and provide a framework for developing those this chapter offers a brief introduction to statistical decision theory.


Bayes Classifier

Consider a classification task where the response Y is categorical and therefore can take on values of a set of classes {0, 1, ..., M }. A classifier is a function F : X → {0, 1, ..., M } which can be interpreted as a guess of y given x which makes an error if F (x) 6= y. In this case the best classifier is [Camastra and Vinciarelli, 2015]

F∗ = arg min


P F (X) 6= Y. (1.5)

This simply means that an observation x is assigned to y ∈ {0, 1, ..., M } if

P (Y = y|X = x) (1.6)

is largest over all classes {0, 1, ..., M }. The classifier F∗ is called Bayes classifier and can be calculated if the joint distribution of (X, Y ) is known, which unfortunately is not the case in most real world applications. Nonetheless, the Bayes classifier has the elegant property that it is the optimal classifier. This means that


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 11


For the sake of simplicity only the binary case with F : Rn→ {0, 1} will be considered. Define the conditional probability p(x) := P (Y = 1|X = x) and let P (F (X) 6= Y |X = x) be the conditional error probability for X = x of any classifier F . Then the Bayes classifier is defined as F∗(x) =    1 if p(x) > 1 2 0 else. (1.8)

Therefore the conditional error probability

P (F (X) 6= Y |X = x) = 1 − P (F (X) = Y |X = x) (1.9) = 1 −P (Y = 1, F (X) = 1|X = x) + P (Y = 0, F (X) = 0|X = x) = 1 −1F (x)=1P (Y = 1|X = x) + 1F (x)=0P (Y = 0|X = x)  = 1 −1F (x)=1p(x) + 1F (x)=0(1 − p(x)) 

where 1 corresponds to the indicator function [Camastra and Vinciarelli, 2015]. Now,

P (F (X) 6= Y |X = x) − P (F∗(X) 6= Y |X = x) = (1.10)

= p(x)[1F∗(x)=1− 1F (x)=1] + (1 − p(x))[1F(x)=0− 1F (x)=0]

= p(x)[1F∗(x)=1− 1F (x)=1] + (1 − p(x))[1F (x)=0− 1F(x)=0]

= (2p(x) − 1)[1F∗(x)=1− 1F (x)=1].

The last statement in equation (1.10) is nonnegative because if p(x) > 12 then (2p(x) − 1) > 0 and [1F∗(x)=1 − 1F (x)=1] ≥ 0 because 1F(x)=1 = 1. Hence the product of the

two factors is nonnegative. On the other hand if p(x) ≤ 12 then (2p(x) − 1) ≤ 0 and [1F∗(x)=1 − 1F (x)=1] ≤ 0 because 1F(x)=1 = 0. Again the product of both factors is

nonnegative. Thus, the Bayes classifier is optimal among all classifiers.

As mentioned earlier, in reality it can be infeasible to calculate the actual Bayes classifier because the joint probability distribution P (X, Y ) is in general unknown. Nonetheless it can still be used to assess the predictive power of other classifiers. To do that, classes have to be constructed with a known probability distribution and data has to be sampled. Then the Bayes error rate [Gareth et al., 2013]

P F∗(X) 6= Y = 1 − E max

y P (Y = y|X)

(1.11) can be computed for this simulated data and the calculated error rates of the other classifiers can then be compared to it.


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 12

Example 3 (k-nearest neighbours and the Bayes classifier). The k-nearest neighbours algorithm is an easy to implement classification algorithm where [Hastie et al., 2017]

F (x) = 1 k



yi. (1.12)

Nk(x) denotes the set of k nearest neighbours of x, where nearest is calculated by means

of the euclidean distance. Thus, a new observation x is predicted by using the responses of the k nearest neighbours and calculating the average of those (dummy coding has to be used for categorical variables). For more information on k-nearest neighbours consider [Kuhn and Johnson, 2016] or [Hastie et al., 2017]. Figure 1.3 shows a scatter plot of two predictors (X1, X2) where a binary response Y ∈ {blue = 0,orange = 1} is predicted

using a 15-nearest neighbours approach and the Bayes classifier. The orange and blue backgrounds indicate the predicted class, whereas the color of the points show their real class. The boundaries in the plots show the nearest neighbour decision rule respectively the Bayes decision rule, which is the best possible decision rule. The Bayes classifier can be calculated exactly because the data is simulated from a known distribution.

Figure 1.3: Scatter plots of simulated data predicted with 15-nn classifier (left) and the Bayes classifier (right) respectively [Hastie et al., 2017].


Loss functions and Expected Loss

This section provides a thorough introduction in the theory of loss functions, which is necessary for designing own loss functions as well as for the understanding of all kinds of ensemble learning algorithms.


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 13

of the set of all possible actions A. Further let θ ∈ Θ be a parameter belonging to some parameter space Θ.

(i) The parameter θ is called state of nature.

(ii) A function L : Θ × A → R with (θ, a) 7→ L(θ, a) is called loss function.

In the supervised learning setting this would mean that y is the true state of nature θ and the predicted value F (x) corresponds to the chosen action a. Therefore L(y, F (x)) measures the loss if the action F (x) is chosen and y is the true state of nature. Thus the loss function can be constructed in a way to penalize specific prediction errors more than others.

[Camastra and Vinciarelli, 2015] show an example in medical image recognition where the task is to predict if a tumor is malignant or benign. In this case it is more important to avoid predicting malignant tumors as benign than vice versa, be-cause the patient won’t get a therapy in the first case. A suitable loss function has to be constructed carefully.

Now consider the multi-class classification task where the state of nature y ∈ Θ = {0, 1, ..., M } and the classifier F can choose an action out of A = {0, 1, ..., M }. Then using Bayes’ formula the posterior probability p(j|x) for class j given the observation x can be expressed as

p(j|x) = p(x|j)p(j)

p(x) (1.13)

where the evidence or marginal density is given as

p(x) =




p(x|j) p(j). (1.14)

p(j) is called the prior probability or simply prior for the class j. Assume that the loss for taking action i if the observation x belongs to class j is L(y = j, F (x) = i). Thus the expected loss or sometimes called conditional risk for taking action i is [Camastra and Vinciarelli, 2015]

R(i|x) =




L(j, i) p(j|x). (1.15)

The total risk for a classifier F can then be calculated by Z

R(F (x)|x) p(x) dx. (1.16)


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 14

overall risk is as small as possible for every x. The classifier F is often called decision function in statistical decision theory and is denoted by δ.

Given x the loss can be minimized by choosing those action for which the expected loss R(i|x) is minimal. So, formally, a classifier F has to be calculated which chooses the action with minimal R(i|x), ∀i = 1, ..., M given x. This is called the Bayes decision rule and the minimum resulting from applying the Bayes decision rule is called Bayes risk [Camastra and Vinciarelli, 2015].

To finalize the section about decision theory and loss functions this chapter concludes with the introduction of commonly used loss functions. Furthermore, their connection with well known classifiers is shown.

Zero-One Loss

Consider again the multi-class classification task, i.e. y ∈ {0, 1, ..., M }. In this case the zero-one loss function or sometimes called symmetrical loss

L(y, F (x)) =    0 if F (x) = y 1 else (1.17)

is used in general. In terms of statistical decision theory this would mean that if the chosen action F (x) is equal to the true state of nature y the loss is zero, L(y, F (x)) = 0. Otherwise the loss is one, L(y, F (x)) = 1. Correct predictions are assigned with no penalty, whereas false predictions are assigned with penalty 1, regardless of the kind of false prediction.

The conditional risk in equation (1.15) with zero-one loss can be written as [Hastie et al., 2017]

R(i|x) = M X j=1 L(j, i)p(j|x) (1.18) = M X j6=i p(j|x) = 1 − p(i|x).

Thus, finding the optimal decision with respect to the Bayes decision criterion yields to finding the minimal conditional risk R(i|x) and therefore the maximal posterior probability p(i|x). This maximum posterior probability criterion corresponds again to the well known Bayes classifier.


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 15

Squared-Error Loss

A common loss function for regression (i.e. y ∈ R) is the so called squared-error loss

L(y, F (x)) = (y − F (x))2. (1.19)

In this case the residuals i = yi − F (xi), i = 1, ..., n are measured quadratically.

Remembering equation (1.4) the expected prediction error (EPE), by using the squared-error loss, can be calculated as [Hastie et al., 2017]

EP E(F ) = EXEY |X([Y − F (X)]2|X). (1.20)

Using pointwise minimization this leads to F (x) = arg min

c EY |X

([Y − c)]2|X = x) (1.21)

where the solution is

F (x) = E(Y |X = x). (1.22)

F (x) can be estimated by using the conditional mean

mean(Y |X = x). (1.23)

So, using the squared-error loss the best possible prediction of Y is the conditional expected value given X = x. Equation (1.22) is well known from linear regression where the linearity assumption E(Y |X = x) = xTβ, β = (β

0, ..., βp), x = (1, x1, ..., xp)

is made on the conditional expected value. The squared-error loss is sometimes also called L2 loss.

Absolute-Error Loss

Instead of measuring the residuals quadratically they can also be measured absolutely. This leads toward the absolute-error loss

L(y, F (x)) = |y − F (x)|. (1.24)

Evaluating equation (1.4) with the absolute-error loss results in the conditional median [Hastie et al., 2017]

median(Y |X = x) (1.25)

as an estimator for F . The conditional median is a more robust measure because observations xi with large absolute residuals |yi − F (xi)|, i = 1, ..., n are


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 16

squared-error loss for long-tailored error-distributions is prone to perform worse than the absolute-error loss, but on the other hand the squared-error loss performs better on Gaussian errors. Choosing the appropriate loss function depending on the data situation at hand is key to achieve above average results in statistical learning appli-cations. Using the squared-error loss leads to least-squares (LS) regression, whereas the absolute-error loss leads to least-absolute deviation (LAD) regression. Figure 1.4 visualizes the squared-error,- and absolute-error loss for a true state of nature y = 0.

There exist quite a lot of different loss functions in the statistical literature which try to be robust to long-tailored error distributions while maintaining nearly the same performance as the squared-error loss on Gaussian error distributions. One on them is the famous Huber loss introduced in [Huber, 1964].

Huber Loss

For a specified parameter ξ the Huber loss is defined as

L(y, F (x)) =    [y − F (x)]2 if |y − F (x)| ≤ ξ 2ξ|y − F (x)| − ξ2 else. (1.26)

The parameter ξ defines in which interval around the optimal prediction (F (x) = y) the loss functions maintains similar properties to the squared-error loss. In the interval where the absolute residual y − F (x) > ξ the Huber loss keeps similar properties to the absolute-error. Hence for ξ ≈ 1.5 the Huber loss approximates the squared-error loss in a local vicinity around the origin, for ξ ≈ 1 it provides a trade-off between both and for ξ ≈ 0.5 it approximates the absolute-error loss in a local vicinity around the origin. Figure 1.5 shows the Huber loss for ξ ∈ {0.5, 1, 1.5} for a true state of nature y = 0.

Negative Binomial Log-Likelihood Loss

Consider the case of a binary response y ∈ {0, 1}. This response can be interpreted as the realization of a Bernoulli distributed random variable Y ∼ B(p) where the expected value is modelled as E(Y |X = x) = p = g−1(xTβ) with β = (β0, ..., βp) and

x = (1, x1, ..., xp). g−1 is called inverse link function and is in general chosen as a


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 17

Figure 1.4: Squared-Error Loss (solid) and Absolute-Error Loss (dashed).

is defined as g(p) = log p 1 − p = x Tβ (1.27) g−1(xTβ) = exp x Tβ 1 + exp xTβ = p. (1.28)

Estimating the parameters β = (β0, ..., βp) results in a so-called logistic regression.

Remembering the function approximation problem of equation (1.4) and using the negative binomial log-likelihood loss results in a logistic regression model where F can be interpreted as the log-odds in equation (1.27). Therefore let

p(x) = P (Y = 1|X = x) (1.29)

= exp(F (x))

exp(−F (x)) + exp(F (x))

= 1

1 + exp(−2F (x))

The negative binomial log likelihood loss function (log-loss) is then defined as [Buja et al., 2005] L(y, p(x)) = −y log(p(x)) − (1 − y) log(1 − p(x)) (1.30)


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 18

Figure 1.5: Huber Loss for parameter ξ = 0.5, 1, 1.5.

or transformed as [Friedman, 2001]

L(y, F (x)) = log(1 + exp(−2 y F (x))), for y ∈ {−1, 1}. (1.31) Equation (1.30) is easier to interpret because p(x) represents a probability, whereas F (x) represents log-odds which have to be transformed into a probability first.

If y = 1 is the true label then Equation (1.30) simplifies to − log(p(x)), which converges against zero when p(x) → 1. Hence a larger prediction of p(x) is considered as better than a smaller one, where p(x) = 1 would be the best. Otherwise, if y = 0 is the true label, then a smaller prediction is considered as better because Equation (1.30) can be expressed as − log(1 − p(x)), which converges against zero for p(x) → 0. One considerable advantage of the log-loss over the zero-one loss is the fact that it quantifies errors in a sense that some errors are ”less bad” than others.

For example, assume that y = 1 is the true label and p(x) = 0.4 is the predicted prob-ability. Hence, the log-loss is approximately 0.398, which is considerably better than the log-loss for p(x) = 0.2, which is approximately 0.698. Nevertheless, both would be given a zero-one loss of 1, though not discriminating between the two errors. Figure (1.6) visualizes the log-loss and zero-one loss for y = 0 as the true label. A smoother discrimination between false and correct predictions can be seen.

Remark 3 (Log-Loss & Cross-Entropy). Log-loss and cross-entropy are two related terms often used interchangeably in the machine learning literature, because when


cal-Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 19

Figure 1.6: Log-Loss & Zero-One Loss for true label y = 0.

culating error rates between 0 and 1 they yield the same results. Nevertheless their definitions are quite different in general. For now the definition of the log-loss is suffi-cient and the interested reader can consider [Hastie et al., 2017] for more informations on cross-entropy.

The discussed loss functions in this chapter can be implemented with the statistical computing software R. The following R-code provides one possible implementation: squared_error_loss = function(y,y_pred){ return((y-y_pred)^2) } absolute_error_loss = function(y,y_pred){ return(abs(y-y_pred)) } huber_loss = function(delta,y,y_pred){ loss = rep(0,length(y_pred)) for(i in c(1:length(y_pred))){ if(abs(y-y_pred[i]) <= delta){ loss[i] = (y-y_pred[i])^2 } else{ loss[i] = 2*delta*abs(y-y_pred[i])-delta^2 } } return(loss)


Chapter 1: Introduction to Machine Learning and Statistical Decision Theory 20 } log_loss = function(y,p_pred){ loss = rep(0,length(p_pred)) for(i in c(1:length(p_pred))){ if(y==1){ loss[i] = -log(p_pred[i]) } if(y==0){ loss[i] = -log(1-p_pred[i]) } } return(loss) }


Chapter 2

Ensemble Learning

After a thorough introduction into machine learning and statistical decision theory in the previous chapter, this chapter will emphasize on the theory and implementation of so-called ensemble learning algorithms. One main class of ensemble learning algorithms are the Boosting algorithms, which will be the main focus in this thesis. Basically those algorithms try to combine several models together to one overall model, thus reducing variance, yielding better predictive performance and a more accurate fit to highly nonlinear data than one model alone. Before going deeper into the detailed theory of boosting, an introduction to regression and classification trees, which are the heart for nearly all kinds of ensemble learning methods, is provided.


Tree-based Methods

Tree-based methods were originally introduced in [Breiman et al., 1984]. Roughly spo-ken, these methods try to combine if-then statements to partition the predictor space into several separate rectangles and predicting each observation by using a simple model into one of those rectangles. These simple models are often just constants. So, given an observation x = (x1, x2) for a two-dimensional predictor space (X1, X2), i.e. p = 2,


Chapter 2: Ensemble Learning 22

a simple tree for predicting a response y ∈ R could look like If x1 ≤ t1 then If x2 ≤ t2 then F (x) = c1 Else F (x) = c2 If x1 > t1 then If x1 ≤ t3 then F (x) = c3 If x1 > t3 then If x2 ≤ t4 then F (x) = c4 Else F (x) = c5

where tk, k = 1, ..., 4 are constants defining the separate regions Rm which are

con-nected to formulas cm, m = 1, ..., 5 yielding F (x) (as mentioned those are also

some-times constants). Commonly in literature, like [Kuhn and Johnson, 2016] or

[Hastie et al., 2017], the constants tk are called splits and the formulas cm are called

terminal nodes or leaves.

Interpreting the example above from the function approximation perspective yields [Hastie et al., 2017] F (x) = 5 X m=1 cm1{(x1,x2)∈Rm} (2.1)

Figure 2.1 shows the partitioning of the predictor space into rectangles and the recursive splitting through if-then statements from the example above.


Regression Trees

In general equation (2.1) can be written as

F (x) =




cm1{x=(x1,...,xp)∈Rm}. (2.2)

Assume that the m different regions Rm are already defined. Then, remembering

the concept of loss functions in chapter 1, the sum of the squared residuals i =

P(yi− F (xi))2, i = 1, ..., n can be used as a minimization criterion, which yields


cm = mean(yi|xi ∈ Rm) (2.3)

as the optimal estimation [Hastie et al., 2017]. Since the regions Rm are already known

the parameters cm, m = 1, ..., M are the only parameters which have to be estimated


Chapter 2: Ensemble Learning 23

Figure 2.1: Partition of the predictor space (X1, X2) into separate

regions/rectan-gles Rm (left) and recursive splitting through if-then statements (right) respectively

[Hastie et al., 2017].

the data that is used for parameter estimation and model generation. Then a separate set of pairs is used to perform model validation and performance assessment which is called test data.

For computing the optimal split point t for the predictor variable j if the regions Rm are unknown, [Hastie et al., 2017] propose a greedy optimization approach which

uses two half-planes

R1(j, t) = {xi|xij ≤ t} and R2(j, t) = {xi|xij > t}. (2.4)

Then the optimal j and t can be calculated through

min j,t  min c1 X xi∈R1(j,t) (yi− c1)2+ min c2 X xi∈R2(j,t) (yi− c2)2  (2.5)

where the minimization tasks in the brackets can be solved for any given pair (j, t) through equation (2.3) with R1 = R1(j, t) and R2 = R2(j, t), respectively.

Iterat-ing over all predictors Xj, j = 1, ..., p and then calculating the optimal split point t

separately for each, yields the overall best pair (j, t). Thus the first two regions are calculated and repeating the whole process for both regions again leads to four different regions.


Chapter 2: Ensemble Learning 24

Variance-Bias Trade-Off

Going on like this and computing more distinct regions increases the tree size which is a tuning parameter itself that determines the complexity of the generated model. To better understand model complexity and avoid common pitfalls two terms which arise naturally in studying model complexity are introduced: the bias-variance trade-off and overfitting/underfitting. Suppose that the predictors and the response are connected via Y = f (X) +  where  is a random noise with E() = 0 and Var() = σ2. Applying

squared-error loss the EPE of predicting y given x by using F (x) can be written as [Hastie et al., 2017]

EP E(F (x)) = E Y − F (x)2|X = x

(2.6) = σ2 +EF (x) − f (x)2+ EF (x) − EF (x)2

= σ2 + Bias2 F (x) + Var F (x) = Irreducible Error + Bias2+ Variance.

Thus the absolute error can be decomposed into an irreducible error which can’t be lowered regardless of the goodness of fit, the (squared) bias and the variance. The (squared) bias measures the deviation of the expected value of F (x) from the true value f (x) and the variance gives the expected overall deviation of the estimated func-tion at x. Reducing one of those will generally result in an increase of the other which is called bias-variance dilemma or trade-off.

Generally, a high bias arises when trying to model complex data with a too simple model, e.g. fitting a linear regression to data where a nonlinear relationship between X and Y is inherent. A high variance appears when small changes in the training data result in large changes in F [Gareth et al., 2013]. Hence a highly complex model will typically yield a low bias, but a high variance. Only reducing bias can furthermore re-sult in a too close fit to the training data which sometimes leads to a bad generalization on unknown test data. This phenomenon is known as overfitting and has to be consid-ered in every machine learning task. Underfitting is the opposite of overfitting. Later chapters will introduce methods to avoid overfitting/underfitting like cross-validation.

Calculating the optimal tree size is crucial to avoid overfitting. A too large tree will result in a high variance, whereas a too small one will yield a high bias. The straight-forward approach would be to split a node if it decreases the sum of squares by a predefined amount. Unfortunately this approach can lead to a too early pruning be-cause some split can be more or less useless but though result in a bigger decrease in sum of squares later on in this tree path. Instead [Breiman et al., 1984] propose a


Chapter 2: Ensemble Learning 25

technique called cost-complexity pruning, where one large tree T0 of a predefined size

is grown initially and

Nm := #{xi ∈ Rm} ˆ cm := 1 Nm X xi∈Rm yi Qm(T ) := 1 Nm X xi∈Rm (yi− ˆcm)2.

Then the cost-complexity criterion is defined as

Cα(T ) = |T |



NmQm(T ) + α |T | (2.7)

where |T | gives the number of terminal nodes in a subtree T ⊂ T0. The goal is to

find a subtree Tα which minimizes criterion (2.7); α itself is a tuning parameter which

defines the tree size. A large value of α will yield a smaller tree, whereas small values lead to a larger tree. [Breiman et al., 1984] recommend using ten-fold cross validation for the estimation of α. The estimated α, denoted as ˆα, is the one which minimizes the cross-validated sum of squares which is then used to find the subtree Tαˆ which

minimizes (2.7).


Classification Trees

Predicting a categorical outcome y by using tree-based methods doesn’t differ a lot from regression trees but the measure Qm(T ) has to be altered because the squared

error isn’t appropriate in this case. One suitable measure is the misclassification error or zero-one loss introduced in chapter 1 which in terms of classification trees can be written as 1 Nm X xi∈Rm 1(yi 6= k(m)) (2.8) where k(m) := arg max k ˆ pmk (2.9) ˆ pmk := 1 Nm X xi∈Rm 1(yi = k).


Chapter 2: Ensemble Learning 26

Here k denotes the k-th class. Using (2.9), equation (2.8) can be simplified to 1 − ˆpmk.

Alternatively, [Hastie et al., 2017] suggest using the Gini-Index defined as

X k6=k0 ˆ pmkpˆmk0 = K X k=1 ˆ pmk(1 − ˆpmk). (2.10)

In contrast to the misclassification error the Gini-Index is differentiable and thus it can be used for numerical optimization. [Hastie et al., 2017] recommend using the Gini-Index for tree growing and the misclassification error for the cost-complexity pruning. The discussed tree-based approach in this subsection is called CART (Classification And Regression Trees). Alternative techniques are for example ID3, which was further developed and is now more popular under the name C4.5 or C5.0 which was originally introduced in [Quinlan, 1993]. For this master thesis understanding CART is sufficient. Advantages and Disadvantages of decision trees

In general, trees are commonly used to understand and interpret the underlying data and explain it to non-professionals. Further advantages are that they can automatically handle categorical data without the need to create dummy variables and missing data which has to be imputed in models like linear regression through mean imputation or a linear regression in itself.

One disadvantage when working with decision trees is their high variance. As men-tioned in the subsection about the variance-bias tradeoff high variance means little changes in the data can yield significant changes in the model. This is inherent in decision trees because the effect of a random error in one of the first splits will be carried trough all splits along the tree.

Another disadvantage is that the approximated function F lacks smoothness like the linear regression function. The so called MARS (Multivariate Adaptive Regression Splines) algorithm, which is a modification of the CART algorithm, mitigates this lack of smoothness. For more informations on MARS consider reading the original journal article [Friedman, 1991] or [Hastie et al., 2017]. The MARS algorithm also gives up the binary structure of the tree (splitting every node into two distinct regions) which reduces difficulties when capturing additive structures of the underlying data.

The biggest disadvantage and reason why simple decision trees are not used for difficult predictive tasks is their weak predictive performance. They generally fall behind other common algorithms in accuracy which can be prevented by the use of Boosting which will be discussed in the next section.

Table 2.1 summarizes the most important advantages and disadvantages of decision trees. Table 2.2 compares several statistical learning algorithms with respect to


com-Chapter 2: Ensemble Learning 27

monly used characteristics.

Advantages Disadvantages

Easy to interpret Instability which means in general high variance Can handle categorical data without

the creation of dummy variables Lack of smoothness

Can be visualized easily Problems in capturing additive structure Can also use observations with

missing values without imputation In general weak predictive power Table 2.1: Advantages and disadvantages of decision trees.

Characteristic Deep Learning SVM Trees MARS k-NN

Naturally handle

mixed data types • • • • • • • • •

Naturally handle

missing data • • • • • • • • • • •

Robust to outliers

in the predictor space • • • • • • • • •

Resistant to monotone

transformations • • • • • • •

Automatically detect

irrelevant inputs • • • • • • • • •

Extract linear combinations

of predictors • • • • • • • • ••

Interpretable • • •• • • • •

Predictive performance • • • • • • • •• •

Table 2.2: Comparison of different statistical learning algorithms (• • • = Good, •• = Average, • = Bad), [Hastie et al., 2017].



Basically, the idea behind boosting is to combine a set of ”weak learners” to one ”strong learner” which has a superior predictive power over the single ones. [Schapire, 1990] first showed that the combination of weak learners to improve predictive performance is feasible. Therefore he introduced the concept of strong- and weak learnability.


Strong- and Weak Learnability

Given a concept c, which is a boolean function for the observations, and a concept class C, which is a set of concepts, weak learnability is defined as:

”A concept class C is weakly learnable if the learner can produce an hypoth-esis that performs only slightly better than random guessing.” [Schapire, 1990]


Chapter 2: Ensemble Learning 28

Strong learnability is defined as:

”A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner is able to output an hypothesis that is correct with high probability on all but an arbitrarily small fraction of the instances.” [Schapire, 1990]

Then the main result of the paper was the theorem [Schapire, 1990]

Theorem 1 (Equivalence of strong and weak learnability). A concept class C is weakly learnable if and only if it is strongly learnable.

So, essentially this means that for example decision trees with a predictive power slightly above random guessing can be combined to one learning algorithm with a su-perior performance.

This concept was then further developed in [Freund and Schapire, 1996a] with a strong emphasis on statistical decision theory and the so called Adaptive Boosting (Ad-aBoost). [Friedman et al., 2000] also developed their own boosting algorithm which lead to the current version of Boosting, which is used regularly in different machine learning challenges and industrial applications and is called Gradient Boosting. Gra-dient Boosting was originally proposed in [Friedman, 2001]. Though existing quite a long time boosting owes the current rise in popularity to a highly efficient implemen-tation by Tianqi Chen and Carlos Guestrin, which uses far less resources and is highly scalable, called XGBoost [Chen and Guestrin, 2016]. The R package XGboost will be used later in the implementation part of this chapter.


Numerical Optimization

Numerical Optimization in Parameter Space

Remembering the function estimation problem in equation (1.4) where F = arg min


Y,XL(Y, F (X)) = arg min



EY L Y, F (X)|X = x

, it is common to limit F (x) to a class of functions F (x, Θ) , where Θ = (θ1, θ2, ..., θk) is a

parameter vector, to simplify estimation and optimization tasks. Every Boosting algo-rithm, e.g. AdaBoost ([Freund and Schapire, 1996a]), LogitBoost ([Friedman et al., 2000]) or Gradient Boosting ([Friedman, 2001]), is then based on the additive expansion

F (x, {βm, γm}M1 ) = M




Chapter 2: Ensemble Learning 29

where each function h(x, γ

m) is an aforementioned weak learner. Those weak learners

can for example be Generalized Linear Models (GLMs) or the previously introduced re-gression trees. The parameters γ

m for regression trees are the split variables j, the split

points tk and the mean ˆcm of the terminal node of every tree. The additive expansion

in equation (2.11) is also the key concept of other common statistical machine learning algorithms like wavelets ([Donoho, 1993]), neural networks [Rumelhart et al., 1986], support vector machines ([Vapnik, 1995]), MARS ([Friedman, 1991]) or radial basis functions ([Powell, 1987]).

Using the parametrized approach F (x, Θ) the function approximation problem trans-forms into a parameter optimization problem, where [Friedman, 2001]

Φ(Θ) := EY,XL Y, F (X, Θ)

(2.12) is the objective function. Thus, the optimal set of parameters

Θ∗ = arg min


Φ(Θ) (2.13)

leads to the optimal function

F∗ = F (x, Θ∗). (2.14)

Nevertheless, computing Θ∗ often requires numerical methods, which are generally based on the additive assumption

Θ∗ =




θm (2.15)

to ease computation. Here θ0 is an initial guess and the following increments {θm}M 1

are the so called ”boosts”, which are calculated iteratively depending on the previous iteration step and the optimization algorithm itself [Friedman, 2001].

One regularly used optimization algorithm to solve equation (2.13) is steepest-descent. It is a greedy algorithm in the sense that it calculates the steepest-descent of the objective function Φ(Θ) in every iteration, thus it can get stuck in local minima. Due to its simple and fast computation it is still used in general. Starting with the calculation of the gradient [Friedman, 2001]

{gjm} =  dΦ(Θ) dΘj  Θ=Θm−1  (2.16)


Chapter 2: Ensemble Learning 30

where Θm−1 =Pm−1

i=0 θi yields the step or boost

θm = −ρmgm. (2.17)


m = (g1m, ..., gkm)

T and ρ

m is called the line search along the steepest direction and

calculated through

ρm = arg min ρ

Φ(Θm−1− ρg

m). (2.18)

Remembering the definition of Φ(Θ) in equation (2.12) it can be seen that the loss function has to be differentiable, otherwise steepest descent cannot be used.

Numerical Optimization in Function Space

Trying to optimize the objective function Φ directly without a parametrization in Θ is only possible if the function is parametrized in F (x) itself, thus yielding

Φ(F ) = EY,XL(Y, F (X)) = EX

EY L(Y, F (X))|X = x. (2.19)

Φ(F ) can also be written as

Φ(F (x)) = EYL(Y, F (x))|x. (2.20)

Minimizing equation (2.20) is infeasible in function space because the number of pa-rameters is infinite. Nonetheless, in a real application the number of data points in the training set {xi, yi}ni=1 is finite, hence different optimization methods like steepest

descent can be used. Therefore as in equation (2.15), the optimal F∗(x) is written additively in the form

F∗(x) =




fm(x) (2.21)

where f0(x) is again an initial guess and {fm(x)}M1 are the boosts [Friedman, 2001].

Considering again the steepest descent approach, the gradient

gm(x) =  dΦ(F (x)) dF (x)  F (x)=Fm−1(x) = d EY L(Y, F (x))|x  dF (x)  F (x)=Fm−1(x) (2.22)


Chapter 2: Ensemble Learning 31

has to be calculated first, which can be written as (assuming that differentiation and integration can be exchanged)

gm(x) = EY  d L(y, F (x)) d F (x) |x  F (x)=Fm−1(x) . (2.23) Then fm(x) = −ρmgm(x) (2.24)

where ρm is again calculated through the line search

ρm = arg min

ρ E

Y,XL(y, Fm−1(x) − ρgm(x)). (2.25)


Boosting as Additive Modelling

Reconsidering the parametrized approach F (x, {βm, γm}M1 ) the boosting parameters

{βm, γm}M1 are typically estimated by using the data at hand and minimizing the

estimated expected loss

{βm, γm}M1 = arg min {β0 m,γ0m} M 1 n X i=1 L  yi, M X m=1 βm0 h(xi, γ0 m)  (2.26)

Calculating all parameters of equation (2.26) simultaneously can be infeasible, thus it is common to simply the fit with a single weak learner, also called basis function in every step (βm, γm) = min β,γ n X i=1 L  yi, β h(xi, γ)  . (2.27)

This can be done through forward stagewise additive modelling as in Algorithm 1 [Friedman et al., 2000]. Basically, the optimal weak learner h(x, γ

m) with coefficient

βm is computed at every iteration and added sequentially to the current optimal overall

function Fm−1(x) which leads to Fm(x). At each step m the previous estimated

param-eters {βm−1, γm−1} are not changed or readjusted. The connection between boosting

and additive models was originally shown in [Friedman et al., 2000]. AdaBoost and LogitBoost

Using the so called exponential-loss function


Chapter 2: Ensemble Learning 32

Algorithm 1: Forward Stagewise Additive Modelling

1 Initialize F0(x) = 0. 2 for m = 1 to M do 3 (a) Calculate (βm, γ m) = minβ,γ Pn i=1L yi, Fm−1(xi) + β h(xi, γ)  4 (b) Set Fm(x) = Fm−1(x) + βmh(x, γ m). 5 end

in Algorithm 1 results in the AdaBoost algorithm proposed by [Freund and Schapire, 1996a] or adapted as AdaBoost.M1 (Algorithm 2) by [Freund and Schapire, 1996b] (can be extended to multiclass classification, too).

AdaBoost starts with an initial weak learner h1 and weights every training observation

xi with wi, where wi = n1 at the first iteration. A weighted error e1 is calculated with

a coefficient β1 which determines the contribution of the current classifier h1 to the

overall function F (x). Then in the next iteration step the weights are increased for those observations which were hard to classify previously. Hence, emphasis is put on difficult to classify observations. After the last iteration the sign of the sequentially computed function indicates the final classification. So, in every iteration the training data is weighted and the classifier is fit on this weighted sample. Figure 2.2 shows a flowchart of the AdaBoost algorithm.

No loss function is used in Algorithm 2, because remember that those algorithm was developed without knowing the connection to additive models with exponential loss, which was later shown by [Friedman et al., 2000]. For the exact proof of the equiva-lence of forward stagewise additive models with exponential loss and AdaBoost consider [Hastie et al., 2017] or [Friedman et al., 2000].

Algorithm 2: AdaBoost.M1

1 Initialize weights wi = n1, i = 1, ..., n and F0(x) = 0. 2 for m = 1 to M do

3 (a) Fit the binary weak learner hm(x) ∈ {−1, 1} using the weights wi. 4 (b) Compute the error em =


i=1wi1 yi6=hm(xi)


i=1wi and βm = log

1−em em . 5 (c) Set wi = wi· expβm1 yi 6= hm(xi), i = 1, ..., n. 6 (d) Set Fm(x) = Fm−1(x) + βmhm(x). 7 end 8 Output: F (x) = sign[Fm(x)].

Applying the negative binomial log-likelihood loss

L(y, F (x)) = log 1 + exp(−2F (x)), y ∈ {0, 1} (2.29) to Algorithm 1 yields the so called two-class LogitBoost algorithm (Algorithm 3) in-troduced by [Friedman et al., 2000]. The LogitBoost algorithm fits an additive logistic


Chapter 2: Ensemble Learning 33 Training data h1(x) Weighted training data h2(x) Weighted training data h3(x) Weighted training data hm(x) F (x) = sign PMm=1βmhm(x) 

Figure 2.2: Flowchart of the AdaBoost algorithm.

regression model trough an adaptive Newton algorithm, because the expected Bernoulli log-likelihood for every iteration update can be written as [Friedman et al., 2000]

E l (F (x) + h(x)) = E h

2y (F (x) + h(x)) − log1 + exp 2(F (x) + h(x))i. (2.30) Then the first order derivative at h(x) = 0 is

s(x) = d E l(F (x) + h(x))

d h(x) |h(x)=0 (2.31)

= 2 E(y − p(x)|x) and the second order derivative at h(x) = 0 is

H(x) = d 2 E l (F (x) + h(x)) d h(x)2 | h(x)=0 (2.32) = −4 E(p(x) · (1 − p(x))|x).


Chapter 2: Ensemble Learning 34

Algorithm 3: LogitBoost (binary)

1 Initialize weights wi = n1, F0(x) = 0 and probability estimates

p(xi) = 1

2, i = 1, ..., n. 2 for m = 1 to M do

3 (a) Compute the working response and weights zi = yi −p(xi) p(xi) (1−p(xi)),

wi = p(xi) (1 − p(xi)).

4 (b) Fit hm(x) by a weighted least-squares regression of zi to xi using

weights wi.

5 (c) Set Fm(x) = Fm−1(x) + 12hm(x) and update p(x) = e

F (x)

eF (x)+e−F (x).

6 end

7 Output: F (x) = sign[Fm(x)].

where p(x) is the probability of y = 1 defined as in Algorithm 3. Thus the Newton step in every iteration is

F (x) ← F (x) − H(x)−1s(x) (2.33) = F (x) + E(y − p(x)|x) 2 E p(x)(1 − p(x))|x = F (x) +1 2Ew  y − p(x) p(x)(1 − p(x))|x 

with w(x) = p(x)(1 − p(x)) [Friedman et al., 2000]. Boosting Classification and Regression Trees

Boosting models with regression and classification trees as the basic functions h(x; γ) are considered as one of the best algorithms to use for predictive modelling because they maintain a lot of the advantages inherent to decision trees, while usually increasing their predictive power dramatically. Remembering equation (2.2) from the Regression Tree section yields the weak learner

h(x; γ) =




cj1{x∈Rj} (2.34)

where γ = {c1, c2, ..., cJ, R1, R2, ..., RJ}. The parameters {cj}J1 are coefficients and

{Rj}J1 are again the disjoint regions that cover the whole predictor space Ω. Using

Forward Stagewise Additive Modelling (Algorithm 1) leads to


Chapter 2: Ensemble Learning 35

in line 4, where the optimization problem

γ m = minγ0 m n X i=1 L yi, Fm−1(xi) + h(xi, γ 0 m)  (2.36)

has to be solved in every iteration. As mentioned in the decision tree section calculating the coefficients {cjm}J1m at the m-th iteration when the regions {Rjm}J1m are known is

generally simple. Computing the regions can again be done trough solving the greedy optimization procedure in equation (2.5). The difficulty and computational complexity of this optimization task depends heavily on the used loss function. For example using squared-error loss leads to the optimal tree (in the m-th iteration) which has the best prediction of the residuals i = yi− Fm−1(x), i = 1, ..., n and ˆcjm is simply the mean

of the residuals i in every region [Hastie et al., 2017].

Boosted trees have in general a predictive power which is among the best currently known statistical learning algorithms. Furthermore they maintain the useful advan-tages of trees like handling missing data and mixed data types automatically, robustness to outliers in the predictors and resistance to monotone transformations. Unfortu-nately, they lose the simple interpretability of basic decision trees, hence methods like partial dependence- or relative importance plots have to be used to better understand the data situation at hand.


Gradient Boosting

After the connection between Forward Stagewise Additive models and AdaBoost/Log-itBoost was shown in [Friedman et al., 2000] two main problems when using these mod-els arised. First, the solution to (βm, γm) = minβ,γPni=1L yi, Fm−1(xi) + β h(xi, γ) in

the forward stagewise additive approach in Algorithm 1 can become very difficult or simply not possible to obtain.

Second, consider the function βmh(x, γm) in line 4 of Algorithm 1 and the

steepest-descent optimization discussed earlier. βmh(x, γm) can basically be interpreted as the

best step towards the optimal function F (x) (equation (1.4)) in the sense of steepest-descent given the previous approximative function Fm−1(x). This leads to the

uncon-strained negative parts of the gradient from equation (2.23)

−gm(xi) = −  d L(yi, F (xi)) d F (xi)  F (xi)=Fm−1(xi) . (2.37)

given real data {xi, yi}n1, [Friedman, 2001]. Then −gm = −gm(x1), −gm(x2), ..., −gm(xn)

gives the negative gradient, which has the disadvantage of being only defined on the training data, thus generalization to unknown data is not assured. [Friedman, 2001] in-troduced Gradient Boosting as an alternative which mitigates both problems. Gradient


Chapter 2: Ensemble Learning 36

Boosting chooses those h(x, γ

m) in every iteration where hm = h(x1, γm), h(x2, γm), ...

, h(xn, γ

m) is ”most parallel” to −gm. Most parallel is defined as the h(x, γ) which has

the highest correlation to −gm(x) over the distribution of the given data [Friedman, 2001].

Solving γ m = arg min γ,β n X i=1  − gm(xi) − β h(xi, γ) 2 (2.38)

yields the optimal parameters, in the sense of parallelism, for the basic function h(x, γ

m). Thus, the difficult optimization problem from line 4 in Algorithm 1 was

replaced by a simple least-squares optimization and then the coefficient for the func-tion update in every iterafunc-tion is calculated through a line search

ρm = arg min ρ n X i=1 L yi, Fm−1(x) + ρh(xi, γm). (2.39)

The function update is then

Fm(x) = Fm−1(x) + ρmh(x, γm). (2.40)

Hence, Gradient Boosting solves both problems of the basic forward stagewise additive approach and can be used with any differentiable loss function. The values ˜yi :=

−gm(xi), i = 1, ..., n are called pseudo-responses. Algorithm 4 describes the Gradient

Boosting algorithm proposed by [Friedman, 2001] in detail. Algorithm 4: Gradient Boosting

1 F0(x) = arg minρPni=1L(yi, ρ). 2 for m = 1 to M do 3 (a) ˜yi = −d L(yd F (xi,F (xi)) i)  F (x)=Fm−1(x) i = 1, ..., n. 4 (b) γ m = arg minγ,β Pn i=1[˜yi− βh(xi, γ)]2. 5 (c) ρm = arg minρ Pn i=1L(yi, Fm−1(xi) + ρh(xi, γm)). 6 (d) Fm(x) = Fm−1(x) + ρmh(x, γ m). 7 end

Remark 4. The use of least-squares minimization in line 4 of Algorithm 4 is arbitrary. Any other method for estimating the gradient

gm(x) = EY

 d L(Y, F (x)) d F (x) |x

F (x)=Fm−1(x)

can be used instead. Nonetheless, least-squares is a common choice due to its compu-tational advantages.


Chapter 2: Ensemble Learning 37

The gradient boosting strategy can also be deployed on classification and regression trees [Breiman et al., 1984]. A J -terminal node tree

h(x, γ m) = J X j=1 cjm1{x∈Rjm} (2.41)

to predict the pseudo-responses ˜yi, i = 1, ..., n is built in the usual manner in every

iteration. Thus, the function update becomes

Fm(x) = Fm−1(x) + ρmh(x, γm) (2.42)

which can be rewritten as

Fm(x) = Fm−1(x) + J



τjm1(x ∈ Rjm) (2.43)

with τjm = ρmcjm. The best coefficients τm = (τ1m, ..., τJ m) in every iteration can then

be computed through [Friedman, 2001]

τm = arg min (τ1,...,τJ) n X i=1 L yi, Fm−1(xi) + J X j=1 τj1(x∈ Rjm). (2.44) Equation (2.44) simplifies to τjm = arg min τ X xi∈Rjm L yi, Fm−1(xi) + τ. (2.45)

due to the disjointness of the regions. Algorithm 5 describes gradient tree boosting in detail.

Algorithm 5: Gradient Tree Boosting

1 F0(x) = arg minτ Pn i=1L(yi, τ ). 2 for m = 1 to M do 3 (a) ˜yi = − d L(yi,F (xi)) d F (xi)  F (x)=Fm−1(x) i = 1, ..., n.

4 (b) {Rjm}J1 = J -terminal node tree{˜yi, xi}n1. 5 (c) τjm = arg minτPx

i∈RjmL(yi, Fm−1(xi) + τ ).

6 (d) Fm(x) = Fm−1(x) + τjm1(x ∈ Rjm). 7 end

Stochastic gradient boosting

Jerome H. Friedman further developed his famous gradient boosting algorithm in the article [Friedman, 2002], where he introduced stochastic gradient boosting. Stochastic


Chapter 2: Ensemble Learning 38

gradient boosting provides only a marginal modification to gradient boosting, but can improve performance dramatically when overfitting is a concern.

The idea is to fit the basic functions h(x, γ) in every iteration only on a subsample of the whole training data, thus inducing randomness and therefore preventing the model to overfit. Stochastic gradient boosting is commonly also combined with decision trees as the basic functions, thus stochastic gradient tree boosting is described in Algorithm 6 in detail [Friedman, 2002].

Another advantage of stochastic gradient boosting is that the time needed for the com-putation is reduced by a factor of c = nn˜, where ˜n is the subsample size. Nonetheless drawing too small samples will inevitably result in a high variance between the basic functions and thus the overall function F (x) will not converge against the optimal func-tion of equafunc-tion (1.4). [Friedman, 2002] observed good results when choosing values for c in the interval [0.5, 0.8]; 0.5 is a common choice because then the random samples are nearly equivalent to bootstrap samples [Efron, 1979].

Algorithm 6: Stochastic Gradient Tree Boosting

1 F0(x) = arg minτPni=1L(yi, τ ). 2 for m = 1 to M do

3 (a) Calculate a random subsample {(xi, yi)}, i = 1, ..., ˜n where ˜n < n. 4 (b) ˜yi = −d L(yd F (xi,F (xi))


F (x)=Fm−1(x) i = 1, ..., ˜n.

5 (c) {Rjm}J1 = J -terminal node tree{˜yi, xi}n1˜. 6 (d) τjm = arg minτ


xi∈RjmL(yi, Fm−1(xi) + τ ).

7 (e) Fm(x) = Fm−1(x) + τjm1(x ∈ Rjm). 8 end

Common loss functions for Gradient Boosting

The most common loss function used for Boosting in general is squared-error loss L(y, F (x)) = (y − F (x))2. The derivation of the squared-error loss with factor 1

2 results

in the pseudo-response ˜yi = yi − Fm−1(xi), which means that equation (2.38) only

fits the present residual. Hence, the line search is redundant and gradient boosting on squared-error loss is equivalent to the known forward stagewise additive procedure [Friedman, 2001] (see Algorithm 7).

Measuring the residuals absolutely leads to the absolute-error loss L(y, F (x)) = |y − F (x)|. Through derivation the pseudo-responses ˜yi = sign(yi − Fm−1(x)) can be

ob-tained, which yields the line search parameters

ρm = median  yi− Fm−1(xi) h(xi, γ m)  , i = 1, ..., n. (2.46)


Chapter 2: Ensemble Learning 39

Algorithm 8 shows the exact details of gradient boosting with absolute-error loss. Algorithm 7: Gradient Boosting on squared-error loss

1 F0(x) = ¯y. 2 for m = 1 to M do 3 (a) ˜yi = yi− Fm−1(xi) i = 1, ..., n. 4 (b) ρm, γ m = arg minρ,γ Pn i=1 ˜yi− ρh(xi, γ) 2 . 5 (c) Fm(x) = Fm−1(x) + ρmh(x, γm). 6 end

Algorithm 8: Gradient Boosting on absolute-error loss

1 F0(x) = median(y). 2 for m = 1 to M do 3 (a) ˜yi = sign yi − Fm−1(xi) i = 1, ..., n. 4 (b) γ m = arg minγ,β Pn i=1sign(yi− Fm−1(xi)) − βh(xi, γ) 2 . 5 (c) ρm = median yi −Fm−1(xi) h(xi,γm)  i = 1, ..., n. 6 (d) Fm(x) = Fm−1(x) + ρmh(x, γ m). 7 end

Another loss function discussed in chapter 1 is the Huber loss which is highly resistant to outliers and long-tailed error-distributions, though yielding better results on normally distributed errors than the absolute-error loss in general, [Huber, 1964]. It can therefore be interpreted as a trade-off between the squared-error loss and the absolute-error loss. Derivation leads to the pseudo responses

˜ yi =    [yi− Fm−1(xi)] if |yi− Fm−1(xi)| ≤ δm δmsign(yi− Fm−1(xi)) else. (2.47)

Generally, δmdetermines those data points which are considered as outliers with respect

to the error-distribution in the m-th iteration. Hence it is common to choose

δm = α − th quantile(|yi − Fm−1(xi)|), i = 1, ..., n. (2.48)

Combining all together and using for example regression trees as the basic functions results in Huber tree boosting (Algorithm 9), [Friedman, 2001].

The gradient boosting algorithm of [Friedman, 2001] can also be used in conjunction with the negative binomial log-likelihood loss


Chapter 2: Ensemble Learning 40

Algorithm 9: Gradient tree boosting on Huber loss.

1 F0(x) = median(y). 2 for m = 1 to M do 3 (a) δm = α − th quantile(|yi− Fm−1(xi)|), i = 1, ..., n. 4 (b) ˜yi = ( [yi− Fm−1(xi)] if |yi− Fm−1(xi)| ≤ δm δmsign(yi− Fm−1(xi)) else. 5 (c) {Rjm}J1 = J -terminal node tree{˜yi, xi}n1. 6 (d) ˜rjm = medianx i∈Rjm yi− Fm−1(xi)  j = 1, ..., J. 7 (e) τjm = ˜rjm+N1 jm P

xi∈Rjmsign(yi− Fm−1(xi) − ˜rjm) · min(δm, abs(yi−

Fm−1(xi) − ˜rjm)) j = 1, ..., J.

8 (f) Fm(x) = Fm−1(x) +PJj=1τjm1(x ∈ Rjm)). 9 end

,which leads to binary logistic regression gradient boosting. F (x) is defined as

F (x) = 1 2log  P (y = 1|x) P (y = −1|x)  (2.49)

, which are simply the well-known log-odds with factor12. Derivation of the loss function yields ˜ yi = 2 yi 1 + exp(2 yiFm−1(xi)) . (2.50)

Using again the tree based approach results in [Friedman, 2001] τjm = arg min




log(1 + exp(−2 yi(Fm−1(xi) + τ ))) (2.51)

which can only be solved numerically. [Friedman et al., 2000] recommends using the Newton-Raphson algorithm to approximate τjm:

τjm = P xi∈Rjmy˜i P xi∈Rjm|˜yi|(2 − |˜yi|) . (2.52)

Algorithm 10 again gives the exact details on logistic regression gradient tree boosting. The log-odds in the m-th iteration can be converted to probability estimates through

p1(x) = P (y = 1|x) = 1 1 + exp(−2Fm(x)) (2.53) p−1(x) = P (y = −1|x) = 1 1 + exp(2Fm(x)) . (2.54)

Algorithm 10 can also be extended to K-classes. For more informations on multi-class logistic regression gradient tree boosting consider [Friedman, 2001].