• Nem Talált Eredményt

Asymptotic Properties of Shrinkage Estimators

In document ECONOMETRICS with MACHINE LEARNING (Pldal 32-39)

Valid statistical inference often relies on the asymptotic properties of estimators.

This section provides a brief overview of the asymptotic properties of the shrinkage estimators presented in Section 1.2 and discusses their implications for statistical inference. The literature in this area is highly technical, but rather than focusing on these aspects, the focus here is on the extent to which these can facilitate valid statistical inference typically employed in econometrics, with an emphasis on the qualitative aspects of the results (see the references for technical details).

The asymptotic properties in the shrinkage estimators literature for linear models can broadly be classified into three focus areas, namely:

1. Oracle Properties,

2. Asymptotic distribution of shrinkage estimators, and

3. Asymptotic properties of estimators for parameters that are not part of the shrinkage.

1.4.1 Oracle Properties

The asymptotic properties of the shrinkage estimators presented above are often investigated through the so-calledOracle Properties. Although the origin of the term Oracle Propertiescan be traced back to Donoho and Johnstone (1994), Fan and Li (2001) are often credited as the first to formalize its definition mathematically.

Subsequent presentation can be found in Zou (2006) and Fan, Xue and Zou (2014), with the latter having possibly the most concise definition to-date. The termOracle is used to highlight the feature that Oracle estimator shares the same properties as estimators with the correct set of covariates. In other words, the Oracle estimator can

‘foresee’ the correct set of covariates. While presentations can be slightly different, the fundamental idea is very similar. To aid the presentation, let us rearrange the true parameter vector,𝛽𝛽𝛽0, so that all parameters with non-zero values can be grouped into a sub-vector and all the parameters with zero values can be grouped into another sub-vector. It is also helpful to create a set that contains the indexes of all non-zero coefficients as well as another one that contains the indexes of all zero-coefficients.

Formally, letA={𝑗:𝛽0𝑗 ≠0}and ˆA={𝑗: ˆ𝛽𝑗 ≠0}, and without loss of generality, partition𝛽𝛽𝛽0=

𝛽𝛽𝛽

0A, 𝛽𝛽𝛽

0A𝑐

and ˆ𝛽𝛽𝛽= ˆ 𝛽 𝛽𝛽

A,𝛽𝛽𝛽ˆ

A𝑐

, where𝛽𝛽𝛽0Adenotes the sub-vector of𝛽𝛽𝛽0containing all the non-zero elements of𝛽𝛽𝛽0i.e., those with indexes that belong to A, while𝛽𝛽𝛽A𝑐 is the sub-vector of𝛽𝛽𝛽0containing all the zero elements i.e., those with indexes that do not belong to𝛽𝛽𝛽A. Similar definitions apply to ˆ𝛽𝛽𝛽Aand ˆ𝛽𝛽𝛽A𝑐. Then the estimator ˆ𝛽𝛽𝛽is said to have the Oracle Properties if it has

1. Selection Consistency: lim

𝑁→∞Pr Aˆ=A

=1, and 2. Asymptotic normality:√

𝑁

𝛽𝛽𝛽ˆA−𝛽𝛽𝛽A 𝑑

→𝑁(0,ΣΣΣ),

whereΣΣΣis the variance-covariance matrix of the following estimator ˆ

𝛽 𝛽

𝛽𝑜𝑟 𝑎 𝑐𝑙 𝑒=arg min

𝛽 𝛽 𝛽:𝛽𝛽𝛽A𝑐=0

𝑔(𝛽𝛽𝛽). (1.20)

Equation (1.20) is called theOracle estimatorby Fan et al. (2014) because it shares the same properties as the estimator that contains only the variables with non-zero coefficients. A shrinkage estimator is said to have the Oracle Properties if it is selection consistent, i.e., is able to identify a variable with zero and non-zero coefficients, and has the same asymptotic distribution as a consistent estimator that

1 Linear Econometric Models with Machine Learning 17 contains only the correct set of variables. Note that selection consistency is a weaker condition than consistency in the traditional sense. The requirement of selection consistency is to discriminate coefficients with zero and non-zero values but it does not require the estimates to be consistent if they have non-zero values.

It should be clear from the previous discussions that neither LASSO nor Ridge have Oracle Properties in general since LASSO is typically inconsistent and Ridge does not usually have selection consistency. However, adaLASSO, SCAD and group LASSO have been shown to possess Oracle Properties (for technical details, see Zou, 2006, Fan & Li, 2001 and Yuan & Lin, 2006, respectively). While these shrinkage estimators possess Oracle Properties, their proofs usually rely on the following three assumptions:

Assumption 1. uis a vector of independent, identically distributed random variables with finite variance.

Assumption 2. There exist a matrixCwith finite elements such that𝑁−1Í𝑁

𝑖=1x𝑖x𝑖C=𝑜𝑝(1).

Assumption 3. 𝜆≡𝜆𝑁 =𝑂(𝑁𝑞)for some𝑞∈ (0,1].

While Assumption 2 is fairly standard in the econometric literature, Assumption 1 appears to be restrictive as it does not include common issues in econometrics such as serial correlation or heteroskedasticity. However, several recent studies, such as Wang et al. (2007) and Medeiros and Mendes (2016) have attempted to relax this assumption to non-Gaussian, serial correlated or conditional heteroskedastic errors.

Assumption 3 is perhaps the most interesting. First, it once again highlights the importance of the tuning parameter, not just for the performance of the estimators, but also for their asymptotic properties. Then, the assumption requires that𝑁−𝑞𝜆𝑁−𝜆0= 𝑜𝑝(1)for some𝜆0≥0. This condition trivially holds when𝜆𝑁 =𝜆stays constant regardless of the sample size. However, this is unlikely to be the case in practice when𝜆𝑁 is typically chosen based on cross validation as described in Section 1.3.2.

Indeed, when𝜆𝑁remains constant,𝜆0=0, which implies that the constraint imposes no penalty on the loss asymptotically and in the least squares case, the estimator collapses to the familiar OLS estimator asymptotically. In the case when𝜆𝑁 is a non-decreasing function of𝑁, Assumption 3 assets an upper bound of the growth rate.

This should not be surprising since𝜆𝑁is an inverse function of𝑐. If𝜆𝑁increases, the total length of𝛽𝛽𝛽decreases and that may increase the amount of bias in the estimator.

Another perspective of this assumption is its relation to theuniform signal strength conditionas discussed by C.-H. Zhang and Zhang (2014). The condition asserts that all non-zero coefficients must be greater in magnitude than an inflated level of noise which can be expressed by

𝐶 𝜎

√︂2log𝑝 𝑁

, (1.21)

where𝐶is the inflation factor (see C.-H. Zhang and Zhang (2014)). Essentially, the coefficients are required to be ‘large’ enough (relative to the noise) to ensure selection consistency and this assumption has also been widely used in some of the literature (for examples, see the references within C.-H. Zhang & Zhang, 2014). While there

seems to be an intuitive link between Assumption 3 and the uniform signal strength condition, there does not appear to be any formal discussion about their connection and this could be an interesting area for future research. Perhaps more importantly, the relation between choosing𝜆via cross validation, Assumption 3, and the uniform signal strength condition is still somewhat unclear. This also means that if the true value of the coefficient is sufficiently small, there is a good chance that shrinkage estimators will be unable to identify them and it is unclear how to statistically verify this situation in finite samples. Theoretically, however, if the shrinkage estimator is selection consistent, then it should be able to discriminate coefficients with a small magnitude and coefficients with zero value.

The Oracle Properties are appealing because they seem to suggest that one can select the right set of covariates via selection consistency, and simultaneously estimate the non-zero parameters consistently, which are also asymptotically normal. It is therefore tempting to

1. conduct statistical inference directly on the shrinkage estimates in the usual way or

2. use a shrinkage estimator as a variable selector, then estimate the coefficients using OLS on model with only the selected covariates and conduct statistical inference on the OLS estimates in the usual way.

The last approach is particularly tempting especially for the shrinkage estimators that satisfy selection consistency but not necessarily asymptotic normality, such as the original LASSO. This approach is often calledPost-Selection OLSor in the case of LASSO,Post-LASSO OLS. These two approaches turn out to be overly optimistic in practice as argued by Leeb and Pötscher (2005) and Leeb and Pötscher (2008). The main issue has to do with the mode of convergence in proving the Oracle Properties.

In most cases, the convergence ispointwiserather thanuniform. An implication of the pointwise convergence is that a different𝛽𝛽𝛽0may require different sample sizes before the asymptotic distribution becomes a reasonable approximation. Since the sample size is typically fixed in practice with𝛽𝛽𝛽0unknown, blindly applying the asymptotic result for inference may lead to misleading conclusion, especially if the sample size is not large enough for the asymptotic to ‘kick in’. Worse still, it is typically not possible to examine if the sample size is large enough since it depends on the unknown𝛽𝛽𝛽0. Thus, the main message is that while Oracle Properties are interesting theoretical properties and provide an excellent framework to aid the understanding of different shrinkage estimators, it does not necessary provide the assurance one wishes for in practice for statistical inference.

1.4.2 Asymptotic Distributions

Next, let us examine the distributional properties of shrinkage estimators directly.

The seminal work of Knight and Fu (2000) derived the asymptotic distribution of the Bridge estimator as defined in Equations (1.7) and (1.8). Under Assumptions 1 – 3

1 Linear Econometric Models with Machine Learning 19 with𝑔(𝛽𝛽𝛽)=(y−X𝛽𝛽𝛽)(y−X𝛽𝛽𝛽),𝑞=0.5 and𝛾≥1, Knight and Fu (2000) showed that

√ 𝑁

𝛽𝛽𝛽ˆ𝐵𝑟 𝑖 𝑑 𝑔𝑒−𝛽𝛽𝛽0 𝑑

→arg min𝑉(𝜔𝜔𝜔), (1.22) where

𝑉(𝜔𝜔𝜔)=−2𝜔𝜔𝜔W𝜔𝜔𝜔+𝜔𝜔𝜔C𝜔𝜔𝜔+𝜆0

𝑝

∑︁

𝑗=1

𝑢𝑗sgn𝛽0𝑗|𝛽0𝑗|𝛾−1, (1.23) withW having a 𝑁(0, 𝜎2C) distribution. For𝛾 <1, Assumption 3 needs to be adjusted such that𝑞=𝛾/2 with𝑉(𝜔𝜔𝜔)changes to

𝑉(𝜔𝜔𝜔)=−2𝜔𝜔𝜔W𝜔𝜔𝜔+𝜔𝜔𝜔C𝜔𝜔𝜔+𝜆0

𝑝

∑︁

𝑗=1

|𝑢𝑗|𝛾𝐼(𝛽0𝑗=0). (1.24) Recall that the Bridge regularizer is convex for𝛾≥1, but not for𝛾 <1. This is reflected by the difference in the asymptotic distributions implied by Equations (1.23) and (1.24). Interestingly, but perhaps not surprisingly, the main difference between the two expressions is the term related to the regularizer. Moreover, the growth rate of𝜆𝑁 is also required to be much slower for the𝛾 <1 case than for the𝛾≥1 one.

Let𝜔𝜔𝜔=arg min𝑉(𝜔𝜔𝜔), then for𝛾≥1 it is straightforward to show that 𝜔𝜔𝜔=C1

W−𝜆0sgn𝛽𝛽𝛽0|𝛽𝛽𝛽0|𝛾−1

, (1.25)

wheresgn𝛽𝛽𝛽,|𝛽𝛽𝛽|𝛾−1 and the product of the two terms are understood to be taken element wise. This means

√ 𝑁

𝛽𝛽𝛽ˆ𝐵𝑟 𝑖 𝑑 𝑔𝑒−𝛽𝛽𝛽0 𝑑

C−1

W−𝜆0sgn𝛽𝛽𝛽0|𝛽𝛽𝛽0|𝛾−1

. (1.26)

Note that setting𝛾=1 and𝛾=2 yield the asymptotic distribution of LASSO and Ridge, respectively. However, as indicated in Equation (1.25), the distribution depends on𝛽𝛽𝛽0. the true parameter vector. The result has two important implications. First, Bridge estimator is generally inconsistent and second, the asymptotic distribution depends on the true parameter vector, which means it is subject to the criticism of Leeb and Pötscher (2005). The latter means that the sample size required for the asymptotic distribution to be a ‘reasonable’ approximation to the finite sample distribution depends on the true parameter vector. Since the true parameter vector is not typically known, this means the asymptotic distribution may not be particularly helpful in practice, at least not in the context of hypothesis testing.

While directly statistical inference on the shrinkage estimates appears to be difficult, there are other studies that take slightly different approaches. The two studies that are worth mentioning are Lockhart, Taylor, Tibshirani and Tibshirani (2014) and Lee, Sun, Sun and Taylor (2016). The former developed a test statistics for coefficients as they enter the model during the LARS process, while the latter derived the asymptotic distribution of the least squares estimator conditional on model selection from shrinkage estimator, such as the LASSO.

In addition to the two studies above, it is also worth noting that in some cases, such as LASSO, where the estimates can be bias-corrected and through such correction, the asymptotic distribution of the bias-corrected estimator can be derived (for examples, see C.-H. Zhang & Zhang, 2014 and Fan et al., 2020). Despite the challenges as shown by Leeb and Pötscher (2005) and Leeb and Pötscher (2008), the asymptotic properties of the shrinkage estimators, particularly for purposes of valid statistical inference, remains an active area of research.

Overall, the current knowledge regarding the asymptotic properties of various shrinkage estimators can be summarized as follows:

1. Some shrinkage estimators have shown to possess Oracle Properties, which means asymptotically they can select the right covariates i.e., correctly assign 0 to the coefficients with true 0 value. It also means that the estimators have an asymptotic normal distribution.

2. Despite the Oracle Properties and other asymptotic results, such as those of Knight and Fu (2000), the practical usefulness of these results are still somewhat limited. The sample size required for the asymptotic results to ‘kick in’ depends on the true parameter vector, which is unknown in practice. Thus, one can never be sure about the validity of using the asymptotic distribution for a given sample size.

1.4.3 Partially Penalized (Regularized) Estimator

While valid statistical inference on shrinkage estimators appears to be challenging, there are situations where the parameter of interests may not be part of the shrinkage.

This means the regularizer does not have to be applied to the entire parameter vector.

Specifically, let us rewrite Equation (1.2) as

y=X1𝛽𝛽𝛽1+X2𝛽𝛽𝛽2+u, (1.27) where X1 and X2 are 𝑁×𝑝1 and 𝑁×𝑝2 matrices such that 𝑝 =𝑝1+𝑝2 with X=[X1,X2], and𝛽𝛽𝛽= 𝛽𝛽𝛽

1, 𝛽𝛽𝛽

2

such that𝛽𝛽𝛽1and𝛽𝛽𝛽2are𝑝1×1 and𝑝2×1 parameter vectors. Assume that only𝛽𝛽𝛽2is sparse i.e., contains elements with zero value, and consider the following shrinkage estimator

𝛽𝛽𝛽ˆ

1,𝛽𝛽𝛽ˆ

2

=arg min

𝛽𝛽 𝛽1,𝛽𝛽𝛽2

(y−X1𝛽𝛽𝛽1X2𝛽𝛽𝛽2)(y−X1𝛽𝛽𝛽1X2𝛽𝛽𝛽2) +𝜆 𝑝(𝛽𝛽𝛽2). (1.28) Note that the penalty function (regularizer) applies only to𝛽𝛽𝛽2but not𝛽𝛽𝛽1. A natural question in this case is whether the asymptotic properties of ˆ𝛽𝛽𝛽

1could facilitate valid statistical inference in the usual manner?

In the case of the Bridge estimator, it is possible to show that ˆ𝛽𝛽𝛽

1has an asymptotic normal distribution similar to the OLS estimator. This is formalized in Proposition 1.1.

1 Linear Econometric Models with Machine Learning 21 Proposition 1.1Consider the linear model as defined in Equation (1.27) and the estimator as defined in Equation (1.28) with𝑝(𝛽𝛽𝛽2)=Í𝑝

𝑗=𝑝1+1|𝛽𝑗|𝛾 for some𝛾 >0.

Under Assumptions 1 and 2 along with𝜆𝑁/√

𝑁→𝜆0≥0for𝛾≥1and𝜆𝑁/√ 𝑁𝛾→ 𝜆0≥0, then

√ 𝑁

𝛽𝛽𝛽ˆ

1−𝛽𝛽𝛽01 𝑑

→𝜔

1

where

𝜔

1=C−11 W1,

withW1∼𝑁(0, 𝜎𝑢2I)denoting a𝑝1×1random vector and𝐶1

1 the 𝑝1×𝑝1matrix consisting of the first𝑝1rows and columns ofC.

Proof See the Appendix. □

The implication of Proposition 1.1 is that valid inference in the usual manner should be possible for parameters that are not part of the shrinkage process, at least for the Bridge estimator. This is verified by some Monte Carlo experiment in Section 1.5.1. This leads to the question on whether this idea can be generalized further. For example, consider

y=X1𝛽𝛽𝛽1+u, (1.29)

whereX1is a𝑁×𝑝1matrix containing (some) endogenous variables. Now assume there are𝑝2potential instrumental variables where 𝑝2can be very large. In a Two Stage Least Squares setting, one would typically construct instrumental variables by first estimating

X1=X2ΠΠΠ+v

and set the instrumental variablesZ=X2ΠΠΠ. Given the estimation ofˆ ΠΠΠis separate from𝛽𝛽𝛽1and perhaps more importantly, the main target isZ, which in a sense, should be the best approximation ofX1givenX2. As shown by Belloni et al. (2012), when 𝑝2is large, it is possible to

1. leverage shrinkage estimators to produce the best instruments i.e., the best approximation ofX1givenX2and

2. reduce the number of instruments given the sparse nature of shrinkage estimators and thus alleviate the issue of well too many instrumental variables.

The main message is that it is possible to obtain a consistent and asymptotically normal estimator for𝛽𝛽𝛽1by constructing optimal instruments from shrinkage estimators using a large number of potential instrumental variables. There are two main ingredients which make this approach feasible. The first is that it is possible to obtain ‘optimal’

instrumentsZbased onPost-Selection OLS, that is, OLS after a shrinkage procedure as shown by Belloni and Chernozhukov (2013). Given Z, Belloni et al. (2012) show that the usual IV type estimators, such as ˆ𝛽𝛽𝛽𝐼 𝑉=(ZX1)−1Zy, follow standard asymptotic results. This makes intuitive sense, as the main target in this case is the best approximation ofX1rather than the quality of the estimator forΠΠΠ. Thus, in a sense, this approach leverages the intended usage of shrinkage estimators of

producing optimal approximation and use this approximation as instruments to resolve endogeneity.

This idea can be expanded further to analyze a much wider class of econometric problems. Chernozhukov, Hansen and Spindler (2015) generalized this approach by considering the estimation problem where the parameters would satisfy the following system of equations

𝑀(𝛽𝛽𝛽1, 𝛽𝛽𝛽2)=0. (1.30) Note that least squares, maximum likelihood and Generalized Method of Moments can be captured in this framework. Specifically, the system of equation as denoted by𝑀can be viewed as the first order derivative of the objective function and thus Equation (1.30) represents the First Order Necessary Condition for a wide class of M-estimators. Along with some relatively mild assumptions, Chernozhukov et al.

(2015) show that the following:

𝜕

𝜕 𝛽𝛽𝛽

2

𝑀(𝛽𝛽𝛽1, 𝛽𝛽𝛽2)

𝛽𝛽𝛽2=𝛽𝛽𝛽ˆ2=0, (1.31) where ˆ𝛽𝛽𝛽

2denotes a good quality shrinkage estimator of𝛽𝛽𝛽2, are sufficient to ensure valid statistical inference on ˆ𝛽𝛽𝛽

1. Equation (1.31) is often called theimmunization condition. Roughly speaking, it asserts that if the system of equations as defined in Equation (1.30) is not sensitive, and therefore immune, to small changes of𝛽𝛽𝛽2for a given estimator of𝛽𝛽𝛽2, then statistical inference based on ˆ𝛽𝛽𝛽

1is possible.6

Using the IV example above,ΠΠΠ, the coefficient matrix for constructing optimal instruments, can be interpreted as𝛽𝛽𝛽2in Equation (1.30). Equation (1.31) therefore requires the estimation of𝛽𝛽𝛽1not to be sensitive to small changes in the shrinkage estimators used to estimateΠΠΠ. In other words, the condition requires that a small change in ˆΠΠΠdoes not affect the estimation of𝛽𝛽𝛽1. This is indeed the case as long as the small changes in ˆΠΠΠdo not affect the estimation of the instrumentsZsignificantly.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 32-39)