• Nem Talált Eredményt

Estimation, Tuning Parameter and Asymptotic Properties

In document ECONOMETRICS with MACHINE LEARNING (Pldal 67-73)

Nonlinear Econometric Models with Machine Learning

2.2 Regularization for Nonlinear Econometric Models

2.2.3 Estimation, Tuning Parameter and Asymptotic Properties

𝑛

∑︁

𝑖=1

−exp(x𝑖𝛽𝛽𝛽) +𝑦𝑖(x𝑖𝛽𝛽𝛽) +𝑦𝑖!,

where the last term𝑦𝑖! is a constant and is often omitted for purposes of estimation as it does not affect the computation of the estimator. The shrinkage estimator with a particular regularizer for Poisson model is defined as

ˆ 𝛽

𝛽𝛽=arg max

𝛽 𝛽 𝛽

𝑛

∑︁

𝑖=1

−exp(x𝑖𝛽𝛽𝛽) +𝑦𝑖(x𝑖𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽;𝛼𝛼𝛼). (2.18)

2 Nonlinear Econometric Models with Machine Learning 51 Tuning Parameter and Cross-Validation

Like in the linear case, the tuning parameter,𝜆, plays an important role in terms of the performance of the shrinkage estimators for nonlinear model in practice. In the case when the response variable is discrete, the cross validation as introduced in Chapter 1 needs to be modified for shrinkage estimator with likelihood objective. The main reason for this modification is that the calculation of ‘residuals’, which the least squares seek to minimise, is not obvious in the context of nonlinear models. This is particularly true when the response variable is discrete. A different measurement of

‘errors’ in the form ofdevianceis required for purposes of identifying the optimal tuning parameter. One such modification can be found below:

1. For each𝜆value in a sequence of values (𝜆1> 𝜆2>· · ·> 𝜆𝑇), estimate the model for each fold, leaving one fold out at a time. This produces a vector of estimates for each𝜆value:𝜷ˆ1. . .𝜷ˆ𝑻.

2. For each set of estimates, calculate the deviance based on the left out fold/testing dataset. The deviance is defined as

𝑒𝑡𝑘=−2 𝑛𝑘

∑︁

𝑖

log𝑝(𝑦𝑖|𝒙𝒊𝜷ˆ𝑇𝑘), (2.19) where𝑛𝑘is the number of observations in the fold𝑘. The quantity computed for each fold and across all𝜆values.

3. Compute the average and standard deviation of the error/deviance resulting from 𝐾folds.

¯ 𝑒𝑡= 1

𝐾

∑︁

𝑘

𝑒𝑘𝑡. (2.20)

The above average represents the average error/deviance for each𝜆. 𝑠 𝑑(𝑒¯𝑡)=

√︄ 1 𝐾−1

∑︁

𝑘

(𝑒𝑘𝑡−𝑒¯𝑡)2. (2.21) This is the standard deviation of average error associated with each𝜆.

4. Choose thebest𝜆𝑡value based on the measures provided.

Given the above procedure, it is quite often useful to plot the average error rates for each𝜆value. Figure 2.1 plots the average misclassification error corresponding to a sequence of log𝜆𝑡 values.4Based on this, we can identify the value of log𝜆 that minimises the average error.5In order to be conservative (apply a slightly higher penalty), a log𝜆value of one standard deviation higher from the minimum value can also be selected. The vertical dotted lines in Figure 2.1 show both these values.

4This figure is reproduced from Friedman, Hastie and Tibshirani (2010).

5The process discussed here is known to be unstable for moderate sample sizes. In order to ensure robustness, the𝐾-fold process can be repeated𝑛1 times and the𝜆can be obtained as the average of these repeated𝑘-fold cross validations.

Fig. 2.1:LASSO CV Plot

Asymptotic Properties and Statistical Inference

The development of asymptotic properties, such as the Oracle properties as defined in Chapter 1, for shrinkage estimators with nonlinear least squares or likelihood objectives is still in its infancy. To the best of the author’s knowledge, the most general result to-date is provided in, Fan, Xue and Zou (2014) where they obtained Oracle properties for shrinkage estimators with convex objectives and concave regularizers.

This covers the case of LASSO, Ridge, SCAD and Adaptive LASSO with likelihood objectives for the Logit, Probit and Poisson models.

However, as discussed in Chapter 1, the practical usefulness of the Oracle properties has been scrutinised by Leeb and Pötscher (2005) and Leeb and Pötscher (2008).

The use of point-wise convergence in developing these results means the number of observations required for the Oracle properties to be relevant depends on the true parameters, which are unknown in practice. As such, it is unclear if valid inference can be obtained purely based on the Oracle properties.

For linear model, Chapter 1 shows that it is possible to obtain valid inference for Partially Penalised Estimator, at least for the subset of the parameter vector that is not subject to regularization. Shi, Song, Chen and Li (2019) show that the same idea can also apply to shrinkage estimators with likelihood objective for Generalised Linear Model with canonical link. Specifically, their results apply to response variables with probability distribution function of the form

2 Nonlinear Econometric Models with Machine Learning 53

𝑓(𝑦𝑖;x𝑖)=exp

𝑦𝑖x𝑖𝛽𝛽𝛽−𝜓(x𝑖𝛽𝛽𝛽) 𝜙0

𝛿(𝑦𝑖) (2.22)

for some smooth functions𝜓and𝛿. This specification includes the Logit and Poisson models as special cases, but not the Probit model.

Given Equation (2.22), the corresponding log-likelihood function can be derived in the usual manner. Let𝛽𝛽𝛽= 𝛽𝛽𝛽

1, 𝛽𝛽𝛽

2

where𝛽𝛽𝛽1and𝛽𝛽𝛽2are𝑝1×1 and𝑝2×1 sub-vectors with𝑝=𝑝1+𝑝2. Assume one wishes to test the hypothesis thatB𝛽𝛽𝛽1=C, whereB andCare𝑟×𝑝1matrix and𝑝1×1 vector, respective. Consider the following Partially Penalised Estimator

𝛽𝛽𝛽ˆ𝑃 𝑅,𝛾𝛾𝛾ˆ𝑃 𝑅

=arg max

𝛽 𝛽 𝛽 ,𝛾𝛾𝛾

log𝐿(X𝛽𝛽𝛽, 𝛾𝛾𝛾;y) −𝜆 𝑝(𝛽𝛽𝛽2;𝛼𝛼𝛼) (2.23) and therestrictedPartially Penalised Estimator, where𝛽𝛽𝛽1is assumed to satisfy the restrictionB𝛽𝛽𝛽1=C,

𝛽𝛽𝛽ˆ𝑅 𝑃 𝑅,𝛾𝛾𝛾ˆ𝑅 𝑃 𝑅

=arg max

𝛽𝛽 𝛽 ,𝛾𝛾𝛾

log𝐿(X𝛽𝛽𝛽, 𝛾𝛾𝛾;y) −𝜆 𝑝(𝛽𝛽𝛽2;𝛼𝛼𝛼) (2.24)

s.t. B𝛽𝛽𝛽1=C, (2.25)

where𝑝(𝛽𝛽𝛽;𝛼𝛼𝛼)is afolded concaveregularizer, which includes Bridge with𝛾 >1 and SCAD as special cases, then the likelihood ratio test statistics

2

log𝐿(𝛽𝛽𝛽ˆ𝑃 𝑅,𝛾𝛾𝛾ˆ𝑃 𝑅) −log𝐿(𝛽𝛽𝛽ˆ𝑅 𝑃 𝑅,𝛾𝛾𝛾ˆ𝑅 𝑃 𝑅) 𝑑

∼𝜒2(𝑟). (2.26) The importance of the formulation above is that𝛽𝛽𝛽1, the subset of parameters that is subject to the restriction/hypothesis,B𝛽𝛽𝛽1=C, is not part of the regularization. This is similar to the linear case, where the subset of the parameters that are to be tested are not part of the regularization. In this case, Shi et al. (2019) show that log-ratio test as defined in Equation (2.26) has an asymptotic𝜒2distribution.

The result as stated in Equation (2.26), however, only applies when the response variable has the distribution function in the form of Equation (2.22) with a regularizer that belongs to thefolded concavefamily. While SCAD belongs to this family, other popular regularizers, such as LASSO, adaptive LASSO and Bridge with 𝛾≤1, are not part of this family. Thus, from the perspective of econometrics, the result above is quite limited as it is relevant only to Logit and Poisson models with a SCAD regularizer. It does not cover the Probit model or other popular models and regularizers that are relevant in econometric analysis. Thus, hypothesis testing using shrinkage estimators for the cases that are relevant in econometrics is still an open problem.

However, combining the approach as proposed in Chernozhukov, Hansen and Spindler (2015) as discussed in Chapter 1 with those considered in Shi et al. (2019) may prove to be useful in progressing this line of research. For example, it is possible to derive the asymptotic distribution for the Partially Penalised Estimator for the models considered in Shi et al. (2019) using theImmunization Conditionapproach in, Chernozhukov et al. (2015) as shown in Propositions 2.1 and 2.2.

Proposition 2.1Let𝑦𝑖 be a random variable with the conditional distribution as defined in Equation (2.22) and consider the following Partially Penalised Estimator with likelihood objective,

ˆ 𝛽 𝛽

𝛽=arg max

𝛽𝛽 𝛽

𝑆(𝛽𝛽𝛽)=𝑙(𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽2) (2.27) 𝑙(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

𝑦𝑖x𝑖𝛽𝛽𝛽−𝜓(x𝑖𝛽𝛽𝛽), (2.28) where𝛽𝛽𝛽= 𝛽𝛽𝛽

1𝛽𝛽𝛽

2

andx𝑖=(x1𝑖,x2𝑖), such thatx𝑖𝛽𝛽𝛽=x1𝑖𝛽𝛽𝛽1+x2𝑖𝛽𝛽𝛽2, and𝑝(𝛽𝛽𝛽)denotes the regularizer, such that 𝜕2𝑝

𝜕 𝛽𝛽𝛽2, 𝛽𝛽𝛽

2

exists.

Under the assumptions thatx𝑖 is stationary with finite mean and variance and there exists a well-defined𝜇𝜇𝜇for all𝛽𝛽𝛽, such that

𝜇𝜇𝜇=−𝜕 𝜓

𝜕 𝑧𝑖 x1𝑖x2𝑖

𝜕 𝜓

𝜕 𝑧𝑖

x2𝑖x2𝑖+𝜆

𝜕2𝑝

𝜕 𝛽𝛽𝛽2𝜕 𝛽𝛽𝛽

2

−1

, (2.29)

where𝑧𝑖=x𝑖𝛽𝛽𝛽, then

√ 𝑛

𝛽𝛽𝛽ˆ1−𝛽𝛽𝛽0 𝑑

→𝑁(0,ΓΓΓ−1ΩΓΓΓ−1), (2.30) where

Γ Γ

Γ =−𝜕2𝜓

𝜕 𝑧2

𝑖

x1𝑖x1𝑖+ 𝜕2𝜓

𝜕 𝑧2

𝑖

!2

x1𝑖x2𝑖 𝜕2𝜓

𝜕 𝑧2

𝑖

x2𝑖x2𝑖−𝜆

𝜕2𝑝

𝜕 𝛽𝛽𝛽2𝜕 𝛽𝛽𝛽

2

!1

x2𝑖x1𝑖

 ΩΩ

Ω =𝑉 𝑎𝑟√ 𝑛𝑆(𝛽𝛽𝛽ˆ)

.

Proof See Appendix. □

The result above is quite general and covers Logit model when 𝜓(x𝑖𝛽𝛽𝛽) = log

1+exp(x𝑖𝛽𝛽𝛽)

and Poisson model when𝜓(x𝑖𝛽𝛽𝛽)=exp(x𝑖𝛽𝛽𝛽)with various regular-izers including Bridge and SCAD. But Proposition 2.1 does not cover Probit model.

The same approach can be used to derive the asymptotic distribution of its shrinkage estimators with likelihood objective as shown in Proposition 2.2

Proposition 2.2Consider the following Partially Penalised Estimators for a Probit model

ˆ

𝛽𝛽𝛽=arg max

𝛽𝛽𝛽

𝑆(𝛽𝛽𝛽)=𝑙(𝛽𝛽𝛽) −𝜆 𝑝(𝛽𝛽𝛽2) (2.31) 𝑙(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

𝑦𝑖logΦ(x𝑖𝛽𝛽𝛽) − (1−𝑦𝑖)log

1−Φ(x𝑖𝛽𝛽𝛽)

, (2.32)

2 Nonlinear Econometric Models with Machine Learning 55 where𝛽𝛽𝛽= 𝛽𝛽𝛽

1𝛽𝛽𝛽

2

andx𝑖=(x1𝑖,x2𝑖), such thatx𝑖𝛽𝛽𝛽=x1𝑖𝛽𝛽𝛽1+x2𝑖𝛽𝛽𝛽2and𝑝(𝛽𝛽𝛽)denotes the regularizer, such that 𝜕2𝑝

𝜕 𝛽𝛽𝛽2, 𝛽𝛽𝛽

2

exists.

Under the assumptions thatx𝑖 is stationary with finite mean and variance and there exists a well-defined𝜇𝜇𝜇for all𝛽𝛽𝛽, such that

𝜇 𝜇

𝜇=−ΛΛΛ(𝛽𝛽𝛽)ΘΘΘ(𝛽𝛽𝛽), (2.33) where

ΛΛ Λ(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

{−𝑦𝑖[𝑧𝑖+𝜙(𝑧𝑖)]𝜂(𝑧𝑖)+[ (1−𝑦𝑖) [𝜙(𝑧𝑖) −𝑧𝑖]𝜉(𝑧𝑖)}x1𝑖x2𝑖,

ΘΘΘ(𝛽𝛽𝛽)=

𝑛

∑︁

𝑖=1

{−𝑦𝑖[𝑧𝑖+𝜙(𝑧𝑖)]𝜂(𝑧𝑖)+[ (1−𝑦𝑖) [𝜙(𝑧𝑖) −𝑧𝑖]𝜉(𝑧𝑖)}x𝑖x2𝑖−𝜆

𝜕2𝑝

𝜕 𝛽𝛽𝛽2𝜕 𝛽𝛽𝛽

2

, 𝜂(𝑧𝑖)=𝜙(𝑧𝑖)

Φ(𝑧𝑖), 𝜉(𝑧𝑖)= 𝜙(𝑧𝑖)

1−Φ(𝑧𝑖),

where 𝑧𝑖 =x𝑖𝛽𝛽𝛽, 𝜙(𝑥) and Φ(𝑥) denote the probability density and cumulative distribution functions for a standard normal distribution respectively. Then

√ 𝑛

𝛽𝛽𝛽ˆ1−𝛽𝛽𝛽0 𝑑

→𝑁(0,ΓΓΓ−1ΩΓΓΓ−1), (2.34) where

ΓΓΓ =A1+A2B−1A2, A1=

𝑛

∑︁

𝑖=1

−𝑦𝑖[𝑧𝑖+𝜙(𝑧𝑖)]𝜂(𝑧𝑖)x1𝑖x1𝑖+ (1−𝑦𝑖) [𝜙(𝑧𝑖) −𝑧𝑖]𝜉(𝑧𝑖)x1𝑖x1𝑖,

A2=

𝑛

∑︁

𝑖=1

−𝑦𝑖[𝑧𝑖+𝜙(𝑧𝑖)]𝜂(𝑧𝑖)x1𝑖x1𝑖+ (1−𝑦𝑖) [𝜙(𝑧𝑖) −𝑧𝑖]𝜉(𝑧𝑖)x1𝑖x2𝑖,

B=

𝑛

∑︁

𝑖=1

−𝑦𝑖[𝑧𝑖+𝜙(𝑧𝑖)]𝜂(𝑧𝑖)x1𝑖x2𝑖+ (1−𝑦𝑖) [𝜙(𝑧𝑖) −𝑧𝑖]𝜉(𝑧𝑖)x1𝑖x2𝑖−𝜆

𝜕2𝑝

𝜕 𝛽𝛽𝛽2𝜕 𝛽𝛽𝛽

2

, Ω

ΩΩ =𝑉 𝑎𝑟√ 𝑛𝑆(𝛽𝛽𝛽ˆ)

.

Proof See Appendix. □

Given the results in Propositions 2.1 and 2.2, it is now possible to obtain the asymptotic distribution for the Partially Penalised Estimators for the Logit, Probit and Poisson models under Bridge and SCAD. This should facilitate valid inferences for the subset of parameters that are not part of the regularizations for these models. These results

should also lead to similar results as those derived in Shi et al. (2019), which would further facilitate inferences on parameter restrictions using conventional techniques such as the log-ratio tests. The formal proof of these is left for further research.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 67-73)