Feedforward Neural Networks - Nonlinear Models

The Use of Machine Learning in Treatment Effect Estimation

4.5 Nonlinear Models

4.5.1 Feedforward Neural Networks

4 Forecasting with Machine Learning Methods 131 cases it was suggested that one should include only a small, ˜𝑞, fixed set of predictors, such as five or ten. Nevertheless, the number of models still very large, for example, with𝑝=30 and𝑞=8, there are 5,852,925 regressions to be estimated. An alternative solution is to follow Garcia et al. (2017) and Medeiros et al. (2021) and adopt a similar strategy as in the case of Bagging high-dimensional models. The idea is to start fitting a regression of𝑌_𝑡+ℎon each of the candidate variables and save the 𝑡-statistics of each variable. The𝑡-statistics are ranked by absolute value, and we select the ˜𝑝variables that are more relevant in the ranking. The CSR forecast is calculated on these variables for different values of𝑞. Another possibility is to pre-select the variables by elastic-net or other selection method; see, Chapter 1 for details.

four input variables. The blue and red circles indicate the hidden and output layers, respectively. In the example, there are five elements (called neurons in the NN jargon) in the hidden layer. The arrows from the green to the blue circles represent the linear combination of inputs:𝜸^′_𝑗𝑿𝑡+𝛾₀_{, 𝑗}, 𝑗=1, . . . ,5. Finally, the arrows from the blue to the red circles represent the linear combination of outputs from the hidden layer:

𝛽₀+Í5

𝑗=1𝛽_𝑗𝑆(𝜸^′_𝑗𝑿𝑡+𝛾₀_{, 𝑗}).

Fig. 4.2:Graphical representation of a single hidden layer neural network.

There are several possible choices for the activation functions. In the early days, 𝑆(·)was chosen among the class of squashing functions as per the definition below.

Definition (Squashing (sigmoid) function) A function𝑆:R−→ [𝑎, 𝑏],𝑎 < 𝑏, is a squashing (sigmoid) function if it is non-decreasing, lim

𝑥−→∞𝑆(𝑥)=𝑏and lim

𝑥−→−∞𝑆(𝑥)=

𝑎. □

Historically, the most popular choices are the logistic and hyperbolic tangent functions such that:

Logistic:𝑆(𝑥)= 1 1+exp(−𝑥) Hyperbolic tangent:𝑆(𝑥)=exp(𝑥) −exp(−𝑥)

exp(𝑥) +exp(−𝑥).

The popularity of such functions was partially due to theoretical results on function approximation. Funahashi (1989) establishes that NN models as in (4.31) with generic squashing functions are capable of approximating any continuous functions from one finite dimensional space to another to any desired degree of accuracy, provided

4 Forecasting with Machine Learning Methods 133 that𝐽_𝑇 is sufficiently large. Cybenko (1989) and Hornik, Stinchombe and White (1989) simultaneously proved approximation capabilities of NN models to any Borel measurable function and Hornik et al. (1989) extended the previous results and showed the NN models are also capable to approximate the derivatives of the unknown function. Barron (1993) relate previous results to the number of terms in the model.

Stinchcombe and White (1989) and Park and Sandberg (1991) derived the same results of Cybenko (1989) and Hornik et al. (1989) but without requiring the activation function to be sigmoid. While the former considered a very general class of functions, the later focused on radial-basis functions (RBF) defined as:

Radial Basis:𝑆(𝑥)=exp(−𝑥²).

More recently, Yarotsky (2017) showed that the rectified linear units (ReLU) as Rectified Linear Unit:𝑆(𝑥)=max(0, 𝑥),

are also universal approximators and ReLU activation function is one of the most popular choices among practitioners due to the following advantages:

1. Estimating NN models with ReLU functions is more efficient computationally as compared to other typical choices, such as logistic or hyperbolic tangent. One reason behind such improvement in performance is that the output of the ReLU function is zero whenever the inputs are negative. Thus, fewer units (neurons) are activated, leading to network sparsity.

2. In terms of mathematical operations, the ReLU function involves simpler oper-ations than the hyperbolic tangent and logistic functions, which also improves computational efficiency.

3. Activation functions like the hyperbolic tangent and the logistic functions may suffer from thevanishing gradient problem, where gradients shrink drastically during optimization, such that the estimates are no longer improved. ReLU avoids this by preserving the gradient since it is an unbounded function.

However, ReLU functions suffer from thedying activation problem: many ReLU units yield output values of zero, which happens when the ReLU inputs are negative.

While this characteristic gives ReLU its strengths (through network sparsity), it becomes a problem when most of the inputs to these ReLU units are in the negative range. The worst-case scenario is when the entire network dies, meaning that it becomes just a constant function. A solution to this problem is to use some modified versions of the ReLU function, such as the Leaky ReLU (LeReLU):

Leaky ReLU:𝑆(𝑥)=max(𝛼𝑥 , 𝑥), where 0< 𝛼 <1.

For each estimation window,𝑡=max(𝑟 , 𝑠, ℎ) +1, . . . , 𝑇−ℎ, model (4.31) can be written in matrix notation. Let𝚪=(𝜸˜₁, . . . ,𝜸˜𝐽)be a(𝑝+1) ×𝐽matrix,

𝑿=

1 𝑋₁_{, 𝑡−𝑅−ℎ+1} · · · 𝑋𝑝 , 𝑡−𝑅−ℎ+1

1 𝑋₁_{, 𝑡−𝑅−ℎ+2} · · · 𝑋𝑝 , 𝑡−𝑅−ℎ+2

.. .

1 𝑋₁_{, 𝑡−ℎ} · · · 𝑋_{𝑝 , 𝑡−ℎ} ª

| {z }¬

𝑅×(𝑝+1)

,and

O(𝑿𝚪)=

1𝑆(𝜸˜^′₁𝑿˜𝑡−𝑅−ℎ+1) · · ·𝑆(𝜸˜^′𝐽𝑿˜𝑡−𝑅−ℎ+1) 1𝑆(𝜸˜^′₁𝑿˜𝑡−𝑅−ℎ+2) · · ·𝑆(𝜸˜^′𝐽𝑿˜𝑡−𝑅−ℎ+2)

.. .

.. . 1 𝑆(𝜸˜₁^′𝑿˜𝑡−ℎ) · · · 𝑆(𝜸˜^′_𝐽𝑿˜𝑡−ℎ)

| {z }¬

𝑅×(𝐽+1)

Therefore, by defining𝜷=(𝛽₀, 𝛽₁, . . . , 𝛽_𝐾)^′, the output of a feedforward NN is given by:

𝒇𝐷(𝑿,𝜽)=[𝑓_𝐷(𝑿𝑡−𝑅−ℎ+1;𝜽), . . . , 𝑓_𝐷(𝑿𝑡−ℎ;𝜽)]^′





 𝛽₀+Í𝐽

𝑗=1𝛽_𝑗𝑆(𝜸^′_𝑗𝑿𝑡−𝑅−ℎ+1+𝛾₀_{, 𝑗}) ..

. 𝛽₀+Í^𝐽

𝑗=1𝛽_𝑗𝑆(𝜸^′𝑗𝑿𝑡−ℎ+𝛾₀_{, 𝑗})







=O(𝑿𝚪)𝜷.

(4.32)

The number of hidden units (neurons),𝐽, and the choice of activation functions is known as the architecture of the NN model. Once the architecture is defined, the dimension of the parameter vector𝜽=[(𝚪)^′,𝜷^′]^′is𝐷=(𝑝+1) ×𝐽+ (𝐽+1)and can easily get very large such that the unrestricted estimation problem defined as

b𝜽=arg min

𝜽∈R^𝐷

∥𝒀− O (𝑿𝚪)𝜷∥²₂

is unfeasible. A solution is to use regularization as in the case of linear models and consider the minimization of the following function:

𝑄(𝜽)=∥𝒀− O (𝑿𝚪)𝜷∥²₂+𝑝(𝜽), (4.33) where usually𝑝(𝜽)=𝜆𝜽^′𝜽. Traditionally, the most common approach to minimize (4.33) is to use Bayesian methods as in MacKay (1992), MacKay (1992), and Foresee and Hagan (1997). See also Chapter 2 for more examples of regularization with nonlinear models.

A more modern approach is to use a technique known asDropout(Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov, 2014). The key idea is to randomly

4 Forecasting with Machine Learning Methods 135 drop neurons (along with their connections) from the neural network during estimation.

A NN with𝐽neurons in the hidden layer can generate 2^𝐽 possiblethinnedNN by just removing some neurons. Dropout samples from this 2^𝐽 different thinned NN and train the sampled NN. To predict the target variable, we use a single unthinned network that has weights adjusted by the probability law induced by the random drop.

This procedure significantly reduces overfitting and gives major improvements over other regularization methods.

We modify equation (4.31) by

𝑓⁻

𝐷(𝑿𝑡)=𝛽₀+

𝐽

∑︁

𝑗=1

𝑠_𝑗𝛽_𝑗𝑆(𝜸^′𝑗[𝒓⊙𝑿𝑡] +𝑣_𝑗𝛾₀_{, 𝑗}),

where𝑠,𝑣, and𝒓=(𝑟₁, . . . , 𝑟_𝑝) are independent Bernoulli random variables each with probability𝑞of being equal to 1. The NN model is thus estimated by using

𝑓⁻

𝐷(𝑿𝑡)instead of 𝑓_𝐷(𝑿𝑡)where, for each training example, the values of the entries of𝑠,𝑣, and𝒓are drawn from the Bernoulli distribution. The final estimates for𝛽_𝑗, 𝜸𝑗, and𝛾₀_{, 𝑗}are multiplied by𝑞.

4.5.1.2 Deep Neural Networks

A Deep Neural Network model is a straightforward generalization of specification (4.31), where more hidden layers are included in the model, as represented in Figure 4.3. In the figure, we represent a Deep NN with two hidden layers with the same number of hidden units in each. However, the number of hidden units (neurons) can vary across layers.

As pointed out in Mhaska, Liao and Poggio (2017), while the universal approxim-ation property holds for shallow NNs, deep networks can approximate the class of compositional functions as well as shallow networks but with exponentially lower number of training parameters and sample complexity.

Set𝐽_ℓas the number of hidden units in layerℓ∈ {1, . . . , 𝐿}. For each hidden layer ℓdefine𝚪ℓ=(𝜸˜₁ℓ, . . . ,𝜸˜𝑘_ℓℓ). Then the outputOℓ of layerℓis given recursively by

Oℓ(O_ℓ−1(·)𝚪ℓ)

| {z }

𝑝×(𝐽_ℓ+1)

1 𝑆(𝜸˜^′₁_ℓO₁_ℓ−1(·)) · · · 𝑆(𝜸˜^′_𝑘

ℓℓO₁_ℓ−1(·)) 1 𝑆(𝜸˜^′₁_ℓO₂_ℓ−1(·)) · · · 𝑆(𝜸˜^′_𝑘

ℓℓO₂_ℓ−1(·)) ..

.. .

.. . 1𝑆(𝜸˜₁^′_ℓO𝑛ℓ−1(·)) · · · 𝑆(𝜸˜^′_𝐽

ℓℓO𝑛ℓ−1(·)) ª

¬ whereO₀:=𝑿. Therefore, the output of the Deep NN is the composition

𝒉𝐷(𝑿)=O𝐿(· · ·O₃(O₂(O₁(𝑿𝚪1)𝚪₂)𝚪₃) · · · )𝚪𝐿𝜷.

Fig. 4.3:Deep neural network architecture

The estimation of the parameters is usually carried out by stochastic gradient descend methods with dropout to control the complexity of the model.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 148-153)