Applications and Connections to Econometrics

Nonlinear Econometric Models with Machine Learning

2.3 Overview of Tree-based Methods - Classification Trees and Random ForestRandom Forest

2.3.3 Applications and Connections to Econometrics

For econometricians, both regression and classification trees potentially offer some advantages compared to traditional regression-based methods. For a start, they are displayed graphically. This illustrates the decision-making process, and as such the results are relatively easy to interpret. The tree based approach allows the user to explore the variables in order to gauge theirpartitioningability, i.e. how well a given variable is able to classify observations correctly using covariates. This related to variable importance, which is discussed in the section below. Trees can handle qualitative variables directly without the need to create dummy variables, and are relatively robust to outliers. Given enough data, a tree can estimate nonlinear means and interaction effects without the econometrician having to specify these in advance.

Furthermore, non-constant variance (heteroskedasticity) is also accommodated by both classification and regression trees. Many easily accessible and tested packages can be used to build trees. Examples of such packages in R include rpart (Therneau, Atkinson and Ripley (2015)), caret (Kuhn (2008)) and tree (Ripley (2021)). For Python, refer to the scikit-learn library (Pedregosa et al. (2011)).

It is also important to consider some limitations of tree based methods. Trees do not produce any regression coefficients. Hence, it would be challenging to quantify the relationship between the dependant variable and the covariates. Another consideration is computational time of trees, especially as the number of covariates increases. In addition, repeated sampling and tree building (bagging and random forests) are also computationally demanding. However, in such instances, parallel processing can help lower computational time. As mentioned earlier, single trees also have a tendency to overfit and as such offer lower predictive accuracy relative to other methods. But, this too can be improved by the use of bagging or random forests.

There are two connections that can be made between trees and topics in economet-rics: nonparametric regression and threshold regression. With regard to the first, trees can be seen as a form nonparametric regression. Consider a simple nonparametric model:

𝑦_𝑖=𝑚(𝑥_𝑖) +𝜀_𝑖, (2.46)

where𝑚(𝑥_𝑖)=𝐸[𝑦_𝑖|𝑥_𝑖]is the conditional mean and𝜀_𝑖 are error terms. Here,𝑚does not have a parametric form and its estimation occurs at particular values of𝑥(local estimation). The most popular method of nonparametric regression is to use a local average of𝑦. Compute the average of all𝑦values with some window of𝑥. By sliding this window along the x-axis, an estimate of the entire regression function can be obtained. In order to improve the estimation, a weighted local average is used, where the weights are applied to the𝑦values. This is achieved using a kernel function. The size of the estimation window also known as bandwidth is part of the kernel function itself. Given this, a nonparametric regression estimator which uses local weighted

2 Nonlinear Econometric Models with Machine Learning 71 averaging can be defined as:

𝑚_{𝑁 𝑊}(𝑥)=

𝑛

∑︁

𝑖=1

𝑤_𝑖𝑦_𝑖 (2.47)

= Í^𝑛

𝑖=1𝐾_ℎ(𝑥−𝑥_𝑖)𝑦_𝑖 Í𝑛

𝑗=1𝐾_ℎ(𝑥−𝑥_𝑗) , (2.48) where𝐾_ℎ(𝑥)denotes the scaled kernel function. It is clear from the above expression the function for the weights (𝑤_𝑖). Note that the sum of the weights is equal to 1 (the denominator is the normalising constant). The subscript NW credits the developers of this estimator: Nadaraya (1964) and Watson (1964).

With an appropriate choice of a kernel function and some modifications on the bandwidth selection, this nonparametric framework can replicate CART. Begin with a fixed bandwidth and create a series of non-overlapping𝑥intervals. For each one of these intervals, choose a kernel function that computes the sample mean of𝑦values that lie in that interval. This leads to the estimated regression function being a step function with regular intervals. The final step is to combine the adjacent intervals where the difference in the sample means (𝑦) is small. The result is a step function with varying𝑥intervals. Each of the end-points of these intervals represents a variable, splits between the difference in the mean value of𝑦is significant. This is similar to the variable splits in the CART.

Replacing kernels with splines also leads to the CART framework. In equation (2.46) let𝑚(𝑥_𝑖)=𝑠(𝑥_𝑖), then

𝑦_𝑖=𝑠(𝑥_𝑖) +𝜀_𝑖, (2.49)

where𝑠(𝑥)is a𝑝th order spline and𝜀_𝑖is the error term, which is assumed to have zero mean. The𝑝th order spline can be written as

𝑠(𝑥)=𝛽₀+𝛽₁𝑥+ · · · +𝛽_𝑝𝑥^𝑝+

𝐽

∑︁

𝑗=1

𝑏_𝑗(𝑥−𝜅_𝑗)₊^𝑝, (2.50) where𝜅_𝑗denotes the 𝑗th threshold value and(𝛼)+=max(0, 𝛼).

Based on the above, trees can be thought of as zero order (𝑝=0) splines. These are essentially piece-wise constant functions (step functions). Using a sufficiently large number of split points, a step function can approximate most functions. Similar to trees, splines are prone to overfitting, which can result in a jagged and noisy fit. However, unlike trees, the covariates need to be processed prior to entering the nonparametric model. For example, categorical variables need to be transformed to a set of dummy variables.

It is worth noting that the nonparametric regression can also be extended to accommodate𝑦as a categorical or count variable. This is achieved by applying a suitable link function to the left-hand side of Equation 2.49. Although, the zero-spline is flexible, it is still a discrete step function, that is, an approximation of a nonlinear continuous function. In order to obtain smoother fit, shrinkage can be applied to reduce the magnitude of𝑏_𝑗.

(𝛽,ˆ 𝑏ˆ)=min

𝛽 , 𝑏

∑︁

𝑖

{𝑦_𝑖−𝑠(𝑥_𝑖)}²+𝛼

𝐽

∑︁

𝑗=1

𝑏²

𝑗. (2.51)

Another option would be to use the LASSO penalty, which would allow certain 𝑏_𝑗 to be shrunk to zero. This analogous topruninga tree.

It is important to note that splines of order greater than zero can model non-constant relationships between threshold points. As such, this provides greater flexibility compared to trees. However, unlike trees, the splines require pre-specified thresholds.8As such, this does not accurately reflect the variable splitting in CART.

In order to accommodate this shortcoming, the Multivariate Adaptive Regression Splines (MARS) was introduced. This method is based on linear splines and its adaptive ability consists of selecting thresholds in order to optimise fit to the data Hazelton (2015). Given this, MARS can be regarded a generalisation of trees. MARS can also handle continuous numerical variables better than trees. In the CART framework, midpoints of numerical variable were assigned as potential thresholds, which is limiting.

Recall that trees are zero-order splines. This is similar to a simplified version of threshold regression, where the only intercepts (between thresholds) are used to model the relationship between the response and the covariates. As an example, a single threshold model is written as

𝑌=𝛽₀₁𝐼(𝑋≤𝜅) +𝛽₀₂𝐼(𝑋 > 𝜅) +𝜀, (2.52) where𝐼()denotes an indicator function,𝜅is the threshold value, and𝜀has a mean of zero. Based on this simple model, if𝑋is less than or equal to the threshold value, the estimated value of𝑌 is equal to the constant𝛽₀₁. This is similar to a split in a regression tree, where the predicted value is equal to the average value of𝑌in that region. This is also evident in the conceptual example provided earlier, see Equation (2.43). As in the case for splines, threshold regression can easily accommodate non-constant relationships, between thresholds. Similarly to the MARS method, this also represents a generalisation of trees.

Given the two connections, econometricians who have used to nonparametric and/or threshold regressions are able to relate to the proposed CART methods. The estimation methods for both nonparametric and threshold regression are well-known in the econometrics literature. In addition, econometricians can also take advantage of the rich theoretical developments in both these areas. This includes concepts such as convergence rates of estimators and their asymptotic properties. These developments represent an advantage over the CART methods, which to-date have no or little theoretical developments (see section below). This is especially important to econometric work which covers inference and/or causality. However, if the primary purpose is illustrating the decision-making process or for exploratory analysis, then implementing CART may be advantageous. There are tradeoffs to consider when selecting a method of analysis. For more details on nonparametric and threshold regression, see Li and Racine (2006) and Hansen (2000) and references within. Based 8These are also called knots.

2 Nonlinear Econometric Models with Machine Learning 73 on this discussion, the section below covers the limited (and recent) developments for inference in trees.

Inference

For a single small tree, it may be possible to visually see the role that each variable plays in producing partitions. In a loose sense, a variable’s partitioning ability corresponds to its importance/significance. However, as the number of variables increases, the visualisation aspect may not be so appealing. Furthermore, when the bagging and/or random forests approach is implemented, it is not possible to represent all the results into a single tree. This makes interpretability even more challenging, despite the improvement in predictions. Nevertheless, the importance/significance of a variable can still be measured by quantifying the increase in RSS (or Gini index) if the variable is excluded from the tree. Repeating this for 𝐵trees, the variable importance score is then computed as the average increase in RSS (or Gini Index).

This process is repeated for all variables and the variable importance score is plotted.

A large score indicates that a removing this variable leads to large increases in RSS (Gini index) on average, and hence the variable is considered ‘important’.

Figure 2.13 shows a sample variable importance plot reproduced using Kuhn (2008). Based on the importance score, variable V11 is considered the most important variable: removing this variable from the tree/s leads to the greatest increase in the Gini index, on average. The second most important variable is V12 and so on.

Practitioners often use the variable importance to select variables: keeping the ones with the largest scores and discarding those with lower ones. However, there seems no agreed cut-off score for this purpose. In-fact, there are no known theoretical properties of variable importance. As such, variable importance scores are to be considered with caution. Based on this, variable importance is of limited use with regard to inference.

More recently, trees have found applications in causal inference settings. This would be of interest to applied econometricians who are interested in modelling heterogeneous treatment effects. Athey and Imbens (2016) introduces methods for constructing trees in order to study causal effects, also providing valid inferences for such.

Given a single binary treatment, the conditional average treatment effect𝑦(𝒙)is given by

𝑦(𝒙)=E[𝑦|𝑑=1,𝒙] −E[𝑦|𝑑=0,𝒙], (2.53) which is the difference between the conditional expected response for the treated group (𝑑=1) and the control group (𝑑=0), given a set of covariates (𝒙). Athey and Imbens (2016) propose a causal tree framework to estimate𝑦(𝒙). It is an extension of the classification and regression tree methods described above. Unlike, trees where the variable split is chosen based on minimising the RSS or Gini Index, causal trees choose a variable split (left and right) that maximises the squared difference between the estimated treatment effects; that is, maximise

Fig. 2.13:Variable Importance Chart

∑︁

left

(𝑦¯₁−𝑦¯₀)²+∑︁

right

(𝑦¯₁−𝑦¯₀)², (2.54) where ¯𝑦_𝑑is the sample mean of observations with treatment𝑑. Furthermore, Athey and Imbens (2016) uses two samples to build a tree. The first sample determines the variable splits, and the second one is used to re-estimate the treatment effects conditional on the splits. Using this setup, Athey and Imbens (2016) derive the approximately Normal sampling distributions for𝑦(𝒙). An R package calledcausalTreeis available for implementing Causal Trees.

2 Nonlinear Econometric Models with Machine Learning 75

In document ECONOMETRICS with MACHINE LEARNING (Pldal 87-92)