Estimation, Tuning Parameter and Asymptotic Properties

Nonlinear Econometric Models with Machine Learning

2.2 Regularization for Nonlinear Econometric Models

2.2.3 Estimation, Tuning Parameter and Asymptotic Properties

𝑛

∑︁

𝑖=1

−exp(x_𝑖^′𝛽𝛽𝛽) +𝑦_𝑖(x_𝑖^′𝛽𝛽𝛽) +𝑦_𝑖!,

where the last term𝑦_𝑖! is a constant and is often omitted for purposes of estimation as it does not affect the computation of the estimator. The shrinkage estimator with a particular regularizer for Poisson model is defined as

ˆ 𝛽

𝛽𝛽=arg max

𝛽 𝛽 𝛽

𝑛

∑︁

𝑖=1

−exp(x^′_𝑖𝛽𝛽𝛽) +𝑦_𝑖(x^′_𝑖𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽;𝛼𝛼𝛼). (2.18)

2 Nonlinear Econometric Models with Machine Learning 51 Tuning Parameter and Cross-Validation

Like in the linear case, the tuning parameter,𝜆, plays an important role in terms of the performance of the shrinkage estimators for nonlinear model in practice. In the case when the response variable is discrete, the cross validation as introduced in Chapter 1 needs to be modified for shrinkage estimator with likelihood objective. The main reason for this modification is that the calculation of ‘residuals’, which the least squares seek to minimise, is not obvious in the context of nonlinear models. This is particularly true when the response variable is discrete. A different measurement of

‘errors’ in the form ofdevianceis required for purposes of identifying the optimal tuning parameter. One such modification can be found below:

1. For each𝜆value in a sequence of values (𝜆₁> 𝜆₂>· · ·> 𝜆_𝑇), estimate the model for each fold, leaving one fold out at a time. This produces a vector of estimates for each𝜆value:𝜷ˆ₁. . .𝜷ˆ𝑻.

2. For each set of estimates, calculate the deviance based on the left out fold/testing dataset. The deviance is defined as

𝑒_𝑡^𝑘=−2 𝑛_𝑘

∑︁

𝑖

log𝑝(𝑦_𝑖|𝒙^′_𝒊𝜷ˆ_𝑇^𝑘), (2.19) where𝑛_𝑘is the number of observations in the fold𝑘. The quantity computed for each fold and across all𝜆values.

3. Compute the average and standard deviation of the error/deviance resulting from 𝐾folds.

¯ 𝑒_𝑡= 1

𝐾

∑︁

𝑘

𝑒^𝑘_𝑡. (2.20)

The above average represents the average error/deviance for each𝜆. 𝑠 𝑑(𝑒¯_𝑡)=

√︄ 1 𝐾−1

∑︁

𝑘

(𝑒^𝑘_𝑡−𝑒¯_𝑡)². (2.21) This is the standard deviation of average error associated with each𝜆.

4. Choose thebest𝜆_𝑡value based on the measures provided.

Given the above procedure, it is quite often useful to plot the average error rates for each𝜆value. Figure 2.1 plots the average misclassification error corresponding to a sequence of log𝜆_𝑡 values.4Based on this, we can identify the value of log𝜆 that minimises the average error.5In order to be conservative (apply a slightly higher penalty), a log𝜆value of one standard deviation higher from the minimum value can also be selected. The vertical dotted lines in Figure 2.1 show both these values.

4This figure is reproduced from Friedman, Hastie and Tibshirani (2010).

5The process discussed here is known to be unstable for moderate sample sizes. In order to ensure robustness, the𝐾-fold process can be repeated𝑛−1 times and the𝜆can be obtained as the average of these repeated𝑘-fold cross validations.

Fig. 2.1:LASSO CV Plot

Asymptotic Properties and Statistical Inference

The development of asymptotic properties, such as the Oracle properties as defined in Chapter 1, for shrinkage estimators with nonlinear least squares or likelihood objectives is still in its infancy. To the best of the author’s knowledge, the most general result to-date is provided in, Fan, Xue and Zou (2014) where they obtained Oracle properties for shrinkage estimators with convex objectives and concave regularizers.

This covers the case of LASSO, Ridge, SCAD and Adaptive LASSO with likelihood objectives for the Logit, Probit and Poisson models.

However, as discussed in Chapter 1, the practical usefulness of the Oracle properties has been scrutinised by Leeb and Pötscher (2005) and Leeb and Pötscher (2008).

The use of point-wise convergence in developing these results means the number of observations required for the Oracle properties to be relevant depends on the true parameters, which are unknown in practice. As such, it is unclear if valid inference can be obtained purely based on the Oracle properties.

For linear model, Chapter 1 shows that it is possible to obtain valid inference for Partially Penalised Estimator, at least for the subset of the parameter vector that is not subject to regularization. Shi, Song, Chen and Li (2019) show that the same idea can also apply to shrinkage estimators with likelihood objective for Generalised Linear Model with canonical link. Specifically, their results apply to response variables with probability distribution function of the form

2 Nonlinear Econometric Models with Machine Learning 53

𝑓(𝑦_𝑖;x𝑖)=exp

𝑦_𝑖x_𝑖^′𝛽𝛽𝛽−𝜓(x^′_𝑖𝛽𝛽𝛽) 𝜙₀

𝛿(𝑦_𝑖) (2.22)

for some smooth functions𝜓and𝛿. This specification includes the Logit and Poisson models as special cases, but not the Probit model.

Given Equation (2.22), the corresponding log-likelihood function can be derived in the usual manner. Let𝛽𝛽𝛽= 𝛽𝛽𝛽^′

1, 𝛽𝛽𝛽^′

^′

where𝛽𝛽𝛽₁and𝛽𝛽𝛽₂are𝑝₁×1 and𝑝₂×1 sub-vectors with𝑝=𝑝₁+𝑝₂. Assume one wishes to test the hypothesis thatB𝛽𝛽𝛽₁=C, whereB andCare𝑟×𝑝₁matrix and𝑝₁×1 vector, respective. Consider the following Partially Penalised Estimator

𝛽𝛽𝛽ˆ_{𝑃 𝑅},𝛾𝛾𝛾ˆ_{𝑃 𝑅}

=arg max

𝛽 𝛽 𝛽 ,𝛾𝛾𝛾

log𝐿(X𝛽𝛽𝛽, 𝛾𝛾𝛾;y) −𝜆 𝑝(𝛽𝛽𝛽₂;𝛼𝛼𝛼) (2.23) and therestrictedPartially Penalised Estimator, where𝛽𝛽𝛽₁is assumed to satisfy the restrictionB𝛽𝛽𝛽₁=C,

𝛽𝛽𝛽ˆ_{𝑅 𝑃 𝑅},𝛾𝛾𝛾ˆ_{𝑅 𝑃 𝑅}

=arg max

𝛽𝛽 𝛽 ,𝛾𝛾𝛾

log𝐿(X𝛽𝛽𝛽, 𝛾𝛾𝛾;y) −𝜆 𝑝(𝛽𝛽𝛽₂;𝛼𝛼𝛼) (2.24)

s.t. B𝛽𝛽𝛽₁=C, (2.25)

where𝑝(𝛽𝛽𝛽;𝛼𝛼𝛼)is afolded concaveregularizer, which includes Bridge with𝛾 >1 and SCAD as special cases, then the likelihood ratio test statistics

log𝐿(𝛽𝛽𝛽ˆ_{𝑃 𝑅},𝛾𝛾𝛾ˆ_{𝑃 𝑅}) −log𝐿(𝛽𝛽𝛽ˆ_{𝑅 𝑃 𝑅},𝛾𝛾𝛾ˆ_{𝑅 𝑃 𝑅}) ^𝑑

∼𝜒²(𝑟). (2.26) The importance of the formulation above is that𝛽𝛽𝛽₁, the subset of parameters that is subject to the restriction/hypothesis,B𝛽𝛽𝛽₁=C, is not part of the regularization. This is similar to the linear case, where the subset of the parameters that are to be tested are not part of the regularization. In this case, Shi et al. (2019) show that log-ratio test as defined in Equation (2.26) has an asymptotic𝜒²distribution.

The result as stated in Equation (2.26), however, only applies when the response variable has the distribution function in the form of Equation (2.22) with a regularizer that belongs to thefolded concavefamily. While SCAD belongs to this family, other popular regularizers, such as LASSO, adaptive LASSO and Bridge with 𝛾≤1, are not part of this family. Thus, from the perspective of econometrics, the result above is quite limited as it is relevant only to Logit and Poisson models with a SCAD regularizer. It does not cover the Probit model or other popular models and regularizers that are relevant in econometric analysis. Thus, hypothesis testing using shrinkage estimators for the cases that are relevant in econometrics is still an open problem.

However, combining the approach as proposed in Chernozhukov, Hansen and Spindler (2015) as discussed in Chapter 1 with those considered in Shi et al. (2019) may prove to be useful in progressing this line of research. For example, it is possible to derive the asymptotic distribution for the Partially Penalised Estimator for the models considered in Shi et al. (2019) using theImmunization Conditionapproach in, Chernozhukov et al. (2015) as shown in Propositions 2.1 and 2.2.

Proposition 2.1Let𝑦_𝑖 be a random variable with the conditional distribution as defined in Equation (2.22) and consider the following Partially Penalised Estimator with likelihood objective,

ˆ 𝛽 𝛽

𝛽=arg max

𝛽𝛽 𝛽

𝑆(𝛽𝛽𝛽)=𝑙(𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽₂) (2.27) 𝑙(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

𝑦_𝑖x𝑖^′𝛽𝛽𝛽−𝜓(x_𝑖^′𝛽𝛽𝛽), (2.28) where𝛽𝛽𝛽= 𝛽𝛽𝛽^′

1𝛽𝛽𝛽^′

andx𝑖=(x₁𝑖,x₂𝑖), such thatx^′_𝑖𝛽𝛽𝛽=x₁^′_𝑖𝛽𝛽𝛽₁+x^′₂_𝑖𝛽𝛽𝛽₂, and𝑝(𝛽𝛽𝛽)denotes the regularizer, such that 𝜕²𝑝

𝜕 𝛽𝛽𝛽₂, 𝛽𝛽𝛽^′

exists.

Under the assumptions thatx𝑖 is stationary with finite mean and variance and there exists a well-defined𝜇𝜇𝜇for all𝛽𝛽𝛽, such that

𝜇𝜇𝜇=−𝜕 𝜓

𝜕 𝑧_𝑖 x₁𝑖x^′₂_𝑖

𝜕 𝜓

𝜕 𝑧_𝑖

x₂𝑖x₂^′_𝑖+𝜆

𝜕²𝑝

𝜕 𝛽𝛽𝛽₂𝜕 𝛽𝛽𝛽^′

−1

, (2.29)

where𝑧_𝑖=x_𝑖^′𝛽𝛽𝛽, then

√ 𝑛

𝛽𝛽𝛽ˆ₁−𝛽𝛽𝛽₀ 𝑑

→𝑁(0,ΓΓΓ⁻¹ΩΓΓΓ⁻¹), (2.30) where

Γ Γ

Γ =−𝜕²𝜓

𝜕 𝑧²

𝑖

x₁𝑖x^′₁_𝑖+ 𝜕²𝜓

𝜕 𝑧²

𝑖

!2





x₁𝑖x₂^′_𝑖 𝜕²𝜓

𝜕 𝑧²

𝑖

x₂𝑖x^′₂_𝑖−𝜆

𝜕²𝑝

𝜕 𝛽𝛽𝛽₂𝜕 𝛽𝛽𝛽^′

!⁻1

x₂𝑖x^′₁_𝑖





 ΩΩ

Ω =𝑉 𝑎𝑟√ 𝑛𝑆(𝛽𝛽𝛽ˆ)

Proof See Appendix. □

The result above is quite general and covers Logit model when 𝜓(x^′_𝑖𝛽𝛽𝛽) = log

1+exp(x_𝑖^′𝛽𝛽𝛽)

and Poisson model when𝜓(x_𝑖^′𝛽𝛽𝛽)=exp(x^′_𝑖𝛽𝛽𝛽)with various regular-izers including Bridge and SCAD. But Proposition 2.1 does not cover Probit model.

The same approach can be used to derive the asymptotic distribution of its shrinkage estimators with likelihood objective as shown in Proposition 2.2

Proposition 2.2Consider the following Partially Penalised Estimators for a Probit model

𝛽𝛽𝛽=arg max

𝛽𝛽𝛽

𝑆(𝛽𝛽𝛽)=𝑙(𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽₂) (2.31) 𝑙(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

𝑦_𝑖logΦ(x^′_𝑖𝛽𝛽𝛽) − (1−𝑦_𝑖)log

1−Φ(x_𝑖^′𝛽𝛽𝛽)

, (2.32)

2 Nonlinear Econometric Models with Machine Learning 55 where𝛽𝛽𝛽= 𝛽𝛽𝛽^′

1𝛽𝛽𝛽^′

andx𝑖=(x₁𝑖,x₂𝑖), such thatx^′_𝑖𝛽𝛽𝛽=x^′₁_𝑖𝛽𝛽𝛽₁+x^′₂_𝑖𝛽𝛽𝛽₂and𝑝(𝛽𝛽𝛽)denotes the regularizer, such that 𝜕²𝑝

𝜕 𝛽𝛽𝛽₂, 𝛽𝛽𝛽^′

exists.

Under the assumptions thatx𝑖 is stationary with finite mean and variance and there exists a well-defined𝜇𝜇𝜇for all𝛽𝛽𝛽, such that

𝜇 𝜇

𝜇=−ΛΛΛ(𝛽𝛽𝛽)ΘΘΘ(𝛽𝛽𝛽), (2.33) where

ΛΛ Λ(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

{−𝑦_𝑖[𝑧_𝑖+𝜙(𝑧_𝑖)]𝜂(𝑧_𝑖)+[ (1−𝑦_𝑖) [𝜙(𝑧_𝑖) −𝑧_𝑖]𝜉(𝑧_𝑖)}x₁𝑖x₂^′_𝑖,

ΘΘΘ(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

{−𝑦_𝑖[𝑧_𝑖+𝜙(𝑧_𝑖)]𝜂(𝑧_𝑖)+[ (1−𝑦_𝑖) [𝜙(𝑧_𝑖) −𝑧_𝑖]𝜉(𝑧_𝑖)}x𝑖x^′₂_𝑖−𝜆

𝜕²𝑝

𝜕 𝛽𝛽𝛽₂𝜕 𝛽𝛽𝛽^′

, 𝜂(𝑧_𝑖)=𝜙(𝑧_𝑖)

Φ(𝑧_𝑖), 𝜉(𝑧_𝑖)= 𝜙(𝑧_𝑖)

1−Φ(𝑧_𝑖),

where 𝑧_𝑖 =x_𝑖^′𝛽𝛽𝛽, 𝜙(𝑥) and Φ(𝑥) denote the probability density and cumulative distribution functions for a standard normal distribution respectively. Then

√ 𝑛

𝛽𝛽𝛽ˆ₁−𝛽𝛽𝛽₀ 𝑑

→𝑁(0,ΓΓΓ⁻¹ΩΓΓΓ⁻¹), (2.34) where

ΓΓΓ =A1+A2B⁻¹A^′₂, A1=

𝑛

∑︁

𝑖=1

−𝑦_𝑖[𝑧_𝑖+𝜙(𝑧_𝑖)]𝜂(𝑧_𝑖)x₁𝑖x^′₁𝑖+ (1−𝑦_𝑖) [𝜙(𝑧_𝑖) −𝑧_𝑖]𝜉(𝑧_𝑖)x₁𝑖x₁^′𝑖,

A₂=

𝑛

∑︁

𝑖=1

−𝑦_𝑖[𝑧_𝑖+𝜙(𝑧_𝑖)]𝜂(𝑧_𝑖)x₁𝑖x^′₁_𝑖+ (1−𝑦_𝑖) [𝜙(𝑧_𝑖) −𝑧_𝑖]𝜉(𝑧_𝑖)x₁𝑖x₂^′_𝑖,

𝑛

∑︁

𝑖=1

−𝑦_𝑖[𝑧_𝑖+𝜙(𝑧_𝑖)]𝜂(𝑧_𝑖)x₁𝑖x^′₂𝑖+ (1−𝑦_𝑖) [𝜙(𝑧_𝑖) −𝑧_𝑖]𝜉(𝑧_𝑖)x₁𝑖x₂^′𝑖−𝜆

𝜕²𝑝

𝜕 𝛽𝛽𝛽₂𝜕 𝛽𝛽𝛽^′

, Ω

ΩΩ =𝑉 𝑎𝑟√ 𝑛𝑆(𝛽𝛽𝛽ˆ)

Proof See Appendix. □

Given the results in Propositions 2.1 and 2.2, it is now possible to obtain the asymptotic distribution for the Partially Penalised Estimators for the Logit, Probit and Poisson models under Bridge and SCAD. This should facilitate valid inferences for the subset of parameters that are not part of the regularizations for these models. These results

should also lead to similar results as those derived in Shi et al. (2019), which would further facilitate inferences on parameter restrictions using conventional techniques such as the log-ratio tests. The formal proof of these is left for further research.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 67-73)