Asymptotic Properties of Shrinkage Estimators

Valid statistical inference often relies on the asymptotic properties of estimators.

This section provides a brief overview of the asymptotic properties of the shrinkage estimators presented in Section 1.2 and discusses their implications for statistical inference. The literature in this area is highly technical, but rather than focusing on these aspects, the focus here is on the extent to which these can facilitate valid statistical inference typically employed in econometrics, with an emphasis on the qualitative aspects of the results (see the references for technical details).

The asymptotic properties in the shrinkage estimators literature for linear models can broadly be classified into three focus areas, namely:

1. Oracle Properties,

2. Asymptotic distribution of shrinkage estimators, and

3. Asymptotic properties of estimators for parameters that are not part of the shrinkage.

1.4.1 Oracle Properties

The asymptotic properties of the shrinkage estimators presented above are often investigated through the so-calledOracle Properties. Although the origin of the term Oracle Propertiescan be traced back to Donoho and Johnstone (1994), Fan and Li (2001) are often credited as the first to formalize its definition mathematically.

Subsequent presentation can be found in Zou (2006) and Fan, Xue and Zou (2014), with the latter having possibly the most concise definition to-date. The termOracle is used to highlight the feature that Oracle estimator shares the same properties as estimators with the correct set of covariates. In other words, the Oracle estimator can

‘foresee’ the correct set of covariates. While presentations can be slightly different, the fundamental idea is very similar. To aid the presentation, let us rearrange the true parameter vector,𝛽𝛽𝛽₀, so that all parameters with non-zero values can be grouped into a sub-vector and all the parameters with zero values can be grouped into another sub-vector. It is also helpful to create a set that contains the indexes of all non-zero coefficients as well as another one that contains the indexes of all zero-coefficients.

Formally, letA={𝑗:𝛽₀_𝑗 ≠0}and ˆA={𝑗: ˆ𝛽_𝑗 ≠0}, and without loss of generality, partition𝛽𝛽𝛽₀=

𝛽𝛽𝛽^′

0A, 𝛽𝛽𝛽^′

0A^𝑐

^′

and ˆ𝛽𝛽𝛽= ˆ 𝛽 𝛽𝛽

′ A,𝛽𝛽𝛽ˆ

′ A^𝑐

^′

, where𝛽𝛽𝛽₀_Adenotes the sub-vector of𝛽𝛽𝛽₀containing all the non-zero elements of𝛽𝛽𝛽₀i.e., those with indexes that belong to A, while𝛽𝛽𝛽_A𝑐 is the sub-vector of𝛽𝛽𝛽₀containing all the zero elements i.e., those with indexes that do not belong to𝛽𝛽𝛽_A. Similar definitions apply to ˆ𝛽𝛽𝛽_Aand ˆ𝛽𝛽𝛽_A𝑐. Then the estimator ˆ𝛽𝛽𝛽is said to have the Oracle Properties if it has

1. Selection Consistency: lim

𝑁→∞Pr Aˆ=A

=1, and 2. Asymptotic normality:√

𝑁

𝛽𝛽𝛽ˆ_A−𝛽𝛽𝛽_A 𝑑

→𝑁(0,ΣΣΣ),

whereΣΣΣis the variance-covariance matrix of the following estimator ˆ

𝛽 𝛽

𝛽𝑜𝑟 𝑎 𝑐𝑙 𝑒=arg min

𝛽 𝛽 𝛽:𝛽𝛽𝛽_A𝑐=0

𝑔(𝛽𝛽𝛽). (1.20)

Equation (1.20) is called theOracle estimatorby Fan et al. (2014) because it shares the same properties as the estimator that contains only the variables with non-zero coefficients. A shrinkage estimator is said to have the Oracle Properties if it is selection consistent, i.e., is able to identify a variable with zero and non-zero coefficients, and has the same asymptotic distribution as a consistent estimator that

1 Linear Econometric Models with Machine Learning 17 contains only the correct set of variables. Note that selection consistency is a weaker condition than consistency in the traditional sense. The requirement of selection consistency is to discriminate coefficients with zero and non-zero values but it does not require the estimates to be consistent if they have non-zero values.

It should be clear from the previous discussions that neither LASSO nor Ridge have Oracle Properties in general since LASSO is typically inconsistent and Ridge does not usually have selection consistency. However, adaLASSO, SCAD and group LASSO have been shown to possess Oracle Properties (for technical details, see Zou, 2006, Fan & Li, 2001 and Yuan & Lin, 2006, respectively). While these shrinkage estimators possess Oracle Properties, their proofs usually rely on the following three assumptions:

Assumption 1. uis a vector of independent, identically distributed random variables with finite variance.

Assumption 2. There exist a matrixCwith finite elements such that𝑁⁻¹Í^𝑁

𝑖=1x𝑖x_𝑖^′− C=𝑜_𝑝(1).

Assumption 3. 𝜆≡𝜆_𝑁 =𝑂(𝑁^𝑞)for some𝑞∈ (0,1].

While Assumption 2 is fairly standard in the econometric literature, Assumption 1 appears to be restrictive as it does not include common issues in econometrics such as serial correlation or heteroskedasticity. However, several recent studies, such as Wang et al. (2007) and Medeiros and Mendes (2016) have attempted to relax this assumption to non-Gaussian, serial correlated or conditional heteroskedastic errors.

Assumption 3 is perhaps the most interesting. First, it once again highlights the importance of the tuning parameter, not just for the performance of the estimators, but also for their asymptotic properties. Then, the assumption requires that𝑁^−𝑞𝜆_𝑁−𝜆₀= 𝑜_𝑝(1)for some𝜆₀≥0. This condition trivially holds when𝜆_𝑁 =𝜆stays constant regardless of the sample size. However, this is unlikely to be the case in practice when𝜆_𝑁 is typically chosen based on cross validation as described in Section 1.3.2.

Indeed, when𝜆_𝑁remains constant,𝜆₀=0, which implies that the constraint imposes no penalty on the loss asymptotically and in the least squares case, the estimator collapses to the familiar OLS estimator asymptotically. In the case when𝜆_𝑁 is a non-decreasing function of𝑁, Assumption 3 assets an upper bound of the growth rate.

This should not be surprising since𝜆_𝑁is an inverse function of𝑐. If𝜆_𝑁increases, the total length of𝛽𝛽𝛽decreases and that may increase the amount of bias in the estimator.

Another perspective of this assumption is its relation to theuniform signal strength conditionas discussed by C.-H. Zhang and Zhang (2014). The condition asserts that all non-zero coefficients must be greater in magnitude than an inflated level of noise which can be expressed by

𝐶 𝜎

√︂2log𝑝 𝑁

, (1.21)

where𝐶is the inflation factor (see C.-H. Zhang and Zhang (2014)). Essentially, the coefficients are required to be ‘large’ enough (relative to the noise) to ensure selection consistency and this assumption has also been widely used in some of the literature (for examples, see the references within C.-H. Zhang & Zhang, 2014). While there

seems to be an intuitive link between Assumption 3 and the uniform signal strength condition, there does not appear to be any formal discussion about their connection and this could be an interesting area for future research. Perhaps more importantly, the relation between choosing𝜆via cross validation, Assumption 3, and the uniform signal strength condition is still somewhat unclear. This also means that if the true value of the coefficient is sufficiently small, there is a good chance that shrinkage estimators will be unable to identify them and it is unclear how to statistically verify this situation in finite samples. Theoretically, however, if the shrinkage estimator is selection consistent, then it should be able to discriminate coefficients with a small magnitude and coefficients with zero value.

The Oracle Properties are appealing because they seem to suggest that one can select the right set of covariates via selection consistency, and simultaneously estimate the non-zero parameters consistently, which are also asymptotically normal. It is therefore tempting to

1. conduct statistical inference directly on the shrinkage estimates in the usual way or

2. use a shrinkage estimator as a variable selector, then estimate the coefficients using OLS on model with only the selected covariates and conduct statistical inference on the OLS estimates in the usual way.

The last approach is particularly tempting especially for the shrinkage estimators that satisfy selection consistency but not necessarily asymptotic normality, such as the original LASSO. This approach is often calledPost-Selection OLSor in the case of LASSO,Post-LASSO OLS. These two approaches turn out to be overly optimistic in practice as argued by Leeb and Pötscher (2005) and Leeb and Pötscher (2008). The main issue has to do with the mode of convergence in proving the Oracle Properties.

In most cases, the convergence ispointwiserather thanuniform. An implication of the pointwise convergence is that a different𝛽𝛽𝛽₀may require different sample sizes before the asymptotic distribution becomes a reasonable approximation. Since the sample size is typically fixed in practice with𝛽𝛽𝛽₀unknown, blindly applying the asymptotic result for inference may lead to misleading conclusion, especially if the sample size is not large enough for the asymptotic to ‘kick in’. Worse still, it is typically not possible to examine if the sample size is large enough since it depends on the unknown𝛽𝛽𝛽₀. Thus, the main message is that while Oracle Properties are interesting theoretical properties and provide an excellent framework to aid the understanding of different shrinkage estimators, it does not necessary provide the assurance one wishes for in practice for statistical inference.

1.4.2 Asymptotic Distributions

Next, let us examine the distributional properties of shrinkage estimators directly.

The seminal work of Knight and Fu (2000) derived the asymptotic distribution of the Bridge estimator as defined in Equations (1.7) and (1.8). Under Assumptions 1 – 3

1 Linear Econometric Models with Machine Learning 19 with𝑔(𝛽𝛽𝛽)=(y−X𝛽𝛽𝛽)^′(y−X𝛽𝛽𝛽),𝑞=0.5 and𝛾≥1, Knight and Fu (2000) showed that

√ 𝑁

𝛽𝛽𝛽ˆ_{𝐵𝑟 𝑖 𝑑 𝑔𝑒}−𝛽𝛽𝛽₀ 𝑑

→arg min𝑉(𝜔𝜔𝜔), (1.22) where

𝑉(𝜔𝜔𝜔)=−2𝜔𝜔𝜔^′W𝜔𝜔𝜔+𝜔𝜔𝜔^′C𝜔𝜔𝜔+𝜆₀

𝑝

∑︁

𝑗=1

𝑢_𝑗sgn𝛽₀_𝑗|𝛽₀_𝑗|^𝛾−¹, (1.23) withW having a 𝑁(0, 𝜎²C) distribution. For𝛾 <1, Assumption 3 needs to be adjusted such that𝑞=𝛾/2 with𝑉(𝜔𝜔𝜔)changes to

𝑉(𝜔𝜔𝜔)=−2𝜔𝜔𝜔^′W𝜔𝜔𝜔+𝜔𝜔𝜔^′C𝜔𝜔𝜔+𝜆₀

𝑝

∑︁

𝑗=1

|𝑢_𝑗|^𝛾𝐼(𝛽₀_𝑗=0). (1.24) Recall that the Bridge regularizer is convex for𝛾≥1, but not for𝛾 <1. This is reflected by the difference in the asymptotic distributions implied by Equations (1.23) and (1.24). Interestingly, but perhaps not surprisingly, the main difference between the two expressions is the term related to the regularizer. Moreover, the growth rate of𝜆_𝑁 is also required to be much slower for the𝛾 <1 case than for the𝛾≥1 one.

Let𝜔𝜔𝜔^∗=arg min𝑉(𝜔𝜔𝜔), then for𝛾≥1 it is straightforward to show that 𝜔𝜔𝜔^∗=C⁻¹

W−𝜆₀sgn𝛽𝛽𝛽₀|𝛽𝛽𝛽₀|^𝛾−¹

, (1.25)

wheresgn𝛽𝛽𝛽,|𝛽𝛽𝛽|^𝛾−¹ and the product of the two terms are understood to be taken element wise. This means

√ 𝑁

𝛽𝛽𝛽ˆ_{𝐵𝑟 𝑖 𝑑 𝑔𝑒}−𝛽𝛽𝛽₀ 𝑑

→C⁻¹

W−𝜆₀sgn𝛽𝛽𝛽₀|𝛽𝛽𝛽₀|^𝛾−1

. (1.26)

Note that setting𝛾=1 and𝛾=2 yield the asymptotic distribution of LASSO and Ridge, respectively. However, as indicated in Equation (1.25), the distribution depends on𝛽𝛽𝛽₀. the true parameter vector. The result has two important implications. First, Bridge estimator is generally inconsistent and second, the asymptotic distribution depends on the true parameter vector, which means it is subject to the criticism of Leeb and Pötscher (2005). The latter means that the sample size required for the asymptotic distribution to be a ‘reasonable’ approximation to the finite sample distribution depends on the true parameter vector. Since the true parameter vector is not typically known, this means the asymptotic distribution may not be particularly helpful in practice, at least not in the context of hypothesis testing.

While directly statistical inference on the shrinkage estimates appears to be difficult, there are other studies that take slightly different approaches. The two studies that are worth mentioning are Lockhart, Taylor, Tibshirani and Tibshirani (2014) and Lee, Sun, Sun and Taylor (2016). The former developed a test statistics for coefficients as they enter the model during the LARS process, while the latter derived the asymptotic distribution of the least squares estimator conditional on model selection from shrinkage estimator, such as the LASSO.

In addition to the two studies above, it is also worth noting that in some cases, such as LASSO, where the estimates can be bias-corrected and through such correction, the asymptotic distribution of the bias-corrected estimator can be derived (for examples, see C.-H. Zhang & Zhang, 2014 and Fan et al., 2020). Despite the challenges as shown by Leeb and Pötscher (2005) and Leeb and Pötscher (2008), the asymptotic properties of the shrinkage estimators, particularly for purposes of valid statistical inference, remains an active area of research.

Overall, the current knowledge regarding the asymptotic properties of various shrinkage estimators can be summarized as follows:

1. Some shrinkage estimators have shown to possess Oracle Properties, which means asymptotically they can select the right covariates i.e., correctly assign 0 to the coefficients with true 0 value. It also means that the estimators have an asymptotic normal distribution.

2. Despite the Oracle Properties and other asymptotic results, such as those of Knight and Fu (2000), the practical usefulness of these results are still somewhat limited. The sample size required for the asymptotic results to ‘kick in’ depends on the true parameter vector, which is unknown in practice. Thus, one can never be sure about the validity of using the asymptotic distribution for a given sample size.

1.4.3 Partially Penalized (Regularized) Estimator

While valid statistical inference on shrinkage estimators appears to be challenging, there are situations where the parameter of interests may not be part of the shrinkage.

This means the regularizer does not have to be applied to the entire parameter vector.

Specifically, let us rewrite Equation (1.2) as

y=X₁𝛽𝛽𝛽₁+X₂𝛽𝛽𝛽₂+u, (1.27) where X₁ and X₂ are 𝑁×𝑝₁ and 𝑁×𝑝₂ matrices such that 𝑝 =𝑝₁+𝑝₂ with X=[X₁,X₂], and𝛽𝛽𝛽= 𝛽𝛽𝛽^′

1, 𝛽𝛽𝛽^′

^′

such that𝛽𝛽𝛽₁and𝛽𝛽𝛽₂are𝑝₁×1 and𝑝₂×1 parameter vectors. Assume that only𝛽𝛽𝛽₂is sparse i.e., contains elements with zero value, and consider the following shrinkage estimator

𝛽𝛽𝛽ˆ

′ 1,𝛽𝛽𝛽ˆ

′ 2

′

=arg min

𝛽𝛽 𝛽₁,𝛽𝛽𝛽₂

(y−X1𝛽𝛽𝛽₁−X2𝛽𝛽𝛽₂)^′(y−X1𝛽𝛽𝛽₁−X2𝛽𝛽𝛽₂) +𝜆 𝑝(𝛽𝛽𝛽₂). (1.28) Note that the penalty function (regularizer) applies only to𝛽𝛽𝛽₂but not𝛽𝛽𝛽₁. A natural question in this case is whether the asymptotic properties of ˆ𝛽𝛽𝛽

1could facilitate valid statistical inference in the usual manner?

In the case of the Bridge estimator, it is possible to show that ˆ𝛽𝛽𝛽

1has an asymptotic normal distribution similar to the OLS estimator. This is formalized in Proposition 1.1.

1 Linear Econometric Models with Machine Learning 21 Proposition 1.1Consider the linear model as defined in Equation (1.27) and the estimator as defined in Equation (1.28) with𝑝(𝛽𝛽𝛽₂)=Í^𝑝

𝑗=𝑝₁+1|𝛽_𝑗|^𝛾 for some𝛾 >0.

Under Assumptions 1 and 2 along with𝜆_𝑁/√

𝑁→𝜆₀≥0for𝛾≥1and𝜆_𝑁/√ 𝑁^𝛾→ 𝜆₀≥0, then

√ 𝑁

𝛽𝛽𝛽ˆ

1−𝛽𝛽𝛽₀₁ 𝑑

→𝜔^∗

where

𝜔^∗

1=C⁻¹₁ W1,

withW1∼𝑁(0, 𝜎_𝑢²I)denoting a𝑝₁×1random vector and𝐶⁻¹

1 the 𝑝₁×𝑝₁matrix consisting of the first𝑝₁rows and columns ofC.

Proof See the Appendix. □

The implication of Proposition 1.1 is that valid inference in the usual manner should be possible for parameters that are not part of the shrinkage process, at least for the Bridge estimator. This is verified by some Monte Carlo experiment in Section 1.5.1. This leads to the question on whether this idea can be generalized further. For example, consider

y=X1𝛽𝛽𝛽₁+u, (1.29)

whereX1is a𝑁×𝑝₁matrix containing (some) endogenous variables. Now assume there are𝑝₂potential instrumental variables where 𝑝₂can be very large. In a Two Stage Least Squares setting, one would typically construct instrumental variables by first estimating

X₁=X₂ΠΠΠ+v

and set the instrumental variablesZ=X₂ΠΠΠ. Given the estimation ofˆ ΠΠΠis separate from𝛽𝛽𝛽₁and perhaps more importantly, the main target isZ, which in a sense, should be the best approximation ofX₁givenX₂. As shown by Belloni et al. (2012), when 𝑝₂is large, it is possible to

1. leverage shrinkage estimators to produce the best instruments i.e., the best approximation ofX₁givenX₂and

2. reduce the number of instruments given the sparse nature of shrinkage estimators and thus alleviate the issue of well too many instrumental variables.

The main message is that it is possible to obtain a consistent and asymptotically normal estimator for𝛽𝛽𝛽₁by constructing optimal instruments from shrinkage estimators using a large number of potential instrumental variables. There are two main ingredients which make this approach feasible. The first is that it is possible to obtain ‘optimal’

instrumentsZbased onPost-Selection OLS, that is, OLS after a shrinkage procedure as shown by Belloni and Chernozhukov (2013). Given Z, Belloni et al. (2012) show that the usual IV type estimators, such as ˆ𝛽𝛽𝛽_{𝐼 𝑉}=(Z^′X₁)⁻¹Z^′y, follow standard asymptotic results. This makes intuitive sense, as the main target in this case is the best approximation ofX₁rather than the quality of the estimator forΠΠΠ. Thus, in a sense, this approach leverages the intended usage of shrinkage estimators of

producing optimal approximation and use this approximation as instruments to resolve endogeneity.

This idea can be expanded further to analyze a much wider class of econometric problems. Chernozhukov, Hansen and Spindler (2015) generalized this approach by considering the estimation problem where the parameters would satisfy the following system of equations

𝑀(𝛽𝛽𝛽₁, 𝛽𝛽𝛽₂)=0. (1.30) Note that least squares, maximum likelihood and Generalized Method of Moments can be captured in this framework. Specifically, the system of equation as denoted by𝑀can be viewed as the first order derivative of the objective function and thus Equation (1.30) represents the First Order Necessary Condition for a wide class of M-estimators. Along with some relatively mild assumptions, Chernozhukov et al.

(2015) show that the following:

𝜕

𝜕 𝛽𝛽𝛽^′

𝑀(𝛽𝛽𝛽₁, 𝛽𝛽𝛽₂)

𝛽𝛽𝛽₂=𝛽𝛽𝛽ˆ₂=0, (1.31) where ˆ𝛽𝛽𝛽

2denotes a good quality shrinkage estimator of𝛽𝛽𝛽₂, are sufficient to ensure valid statistical inference on ˆ𝛽𝛽𝛽

1. Equation (1.31) is often called theimmunization condition. Roughly speaking, it asserts that if the system of equations as defined in Equation (1.30) is not sensitive, and therefore immune, to small changes of𝛽𝛽𝛽₂for a given estimator of𝛽𝛽𝛽₂, then statistical inference based on ˆ𝛽𝛽𝛽

1is possible.6

Using the IV example above,ΠΠΠ, the coefficient matrix for constructing optimal instruments, can be interpreted as𝛽𝛽𝛽₂in Equation (1.30). Equation (1.31) therefore requires the estimation of𝛽𝛽𝛽₁not to be sensitive to small changes in the shrinkage estimators used to estimateΠΠΠ. In other words, the condition requires that a small change in ˆΠΠΠdoes not affect the estimation of𝛽𝛽𝛽₁. This is indeed the case as long as the small changes in ˆΠΠΠdo not affect the estimation of the instrumentsZsignificantly.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 32-39)