• Nem Talált Eredményt

The Causal Tree Approach

In document ECONOMETRICS with MACHINE LEARNING (Pldal 111-116)

The Use of Machine Learning in Treatment Effect Estimation

3.4 Using Machine Learning to Discover Treatment Effect HeterogeneityHeterogeneity

3.4.2 The Causal Tree Approach

Regression tree basics. Regression trees are algorithmically constructed step func-tions used to approximate conditional expectafunc-tions such as𝐸(𝑌|𝑋). More specifically, letΠ ={ℓ1, ℓ2, . . . , ℓ}be a partition ofX=𝑠𝑢 𝑝 𝑝 𝑜𝑟 𝑡(𝑋)and define

¯ 𝑌

𝑗= Í𝑛

𝑖=1𝑌𝑖1𝑗(𝑋𝑖) Í𝑛

𝑖=11𝑗(𝑋𝑖) , 𝑗=1, . . . ,#Π,

to be the average outcome for those observations𝑖for which𝑋𝑖∈ℓ𝑗. A regression tree estimates𝐸(𝑌|𝑋=𝑥)using a step function

ˆ 𝜇(𝑥;Π)=

∑︁ 𝑗=1

¯ 𝑌

𝑗1𝑗(𝑥). (3.14)

The regression tree algorithm considers partitionsΠthat are constructed based on recursive splits of the support of the components of𝑋. Thus, the subsetsℓ𝑗, which are called the leaves of the tree, are given by intersections of sets of the form{𝑋𝑘≤𝑐}or {𝑋𝑘 > 𝑐}, where𝑋𝑘 denotes the𝑘th component of𝑋. In building the regression tree, candidate partitions are evaluated through a mean squared error (MSE) criterion with an added term that penalizes the number of splits to avoid overfitting. Chapter 2 of this volume provides a more in-depth look at classification and regression trees.

From a regression tree to a causal tree. Under the unconfoundendess assumption one way to estimate ATE consistently is to use the inverse probability weighted estimator proposed by Hirano, Imbens and Ridder (2003):

ˆ 𝜏= 1

#S

∑︁

𝑖∈S

𝑌𝑖𝐷𝑖

𝑚(𝑋𝑖)−𝑌𝑖(1−𝐷𝑖) 1−𝑚(𝑋𝑖)

, (3.15)

3 The Use of Machine Learning in Treatment Effect Estimation 95 whereSis the sample of observations on(𝑌𝑖, 𝐷𝑖, 𝑋𝑖), #Sis the sample size, and𝑚(·) is the propensity score function, i.e., the conditional probability𝑚(𝑋)=𝑃(𝐷=1|𝑋).

To simplify the exposition, we will assume that the function𝑚(𝑋)is known.10 Given a subsetℓ⊂ X, one can also implement the estimator (3.15) in the subsample of observations for which 𝑋𝑖∈ℓ, yielding an estimate of the conditional average treatment effect𝜏(ℓ)=𝐸[𝑌(1) −𝑌(0) |𝑋𝑖∈ℓ]. More specifically, we define

ˆ

𝜏S(ℓ)= 1

#ℓ

∑︁

𝑖∈S, 𝑋𝑖∈ℓ

𝑌𝑖𝐷𝑖

𝑚(𝑋𝑖)−𝑌𝑖(1−𝐷𝑖) 1−𝑚(𝑋𝑖)

, where #ℓis the number of observations that fall inℓ.

Computing ˆ𝜏S(ℓ)for various choices ofℓis called subgroup analysis. Based on subject matter theory or policy considerations, a researcher may pre-specify some subgroups of interest. For example, theory may predict that the effect changes as a function of age or income. Nevertheless, theory is rarely detailed enough to say exactly how to specify the relevant age or income groups and may not be able to say whether the two variables interact with each other or other variables. There may be situations, especially if the dimension of𝑋 is high, where the relevant subsets need to be discovered completely empirically by conducting as detailed a search as possible. However, it is now well understood that mining the data ‘by hand’ for relevant subgroups is problematic for two reasons. First, some of these groups may be complex and hard to discover, e.g., they may involve interactions between several variables. Moreover, if𝑋is large there are simply too many possibilities to consider.

Second, it is not clear how to conduct inference for groups uncovered by data mining.

The (asymptotic) distribution of ˆ𝜏(ℓ)is well understood for fixedℓ. But if the search procedure picks a group ˆℓbecause, say, ˆ𝜏(ℓˆ) −𝜏ˆis large, then the distribution of ˆ𝜏(ℓˆ) will of course differ from the fixedℓcase.

In their influential paper Athey and Imbens (2016) propose the use of the regression tree algorithm to search for treatment effect heterogeneity, i.e., to discover the relevant subsets ˆℓfrom the data itself. The resulting partition ˆΠand the estimates ˆ𝜏(ℓˆ), ˆℓ∈Πˆ are called a causal tree. Their key contribution is to modify to the standard regression tree algorithm in a way that accommodates treatment effect estimation (as opposed to prediction) and addresses the statistical inference problem discussed above.

The first proposed modification is what Athey and Imbens (2016) call an ‘honest’

approach. This consist of partitioning the available dataS={(𝑌𝑖, 𝐷𝑖, 𝑋𝑖)}𝑛

𝑖=1into an estimation sampleS𝑒𝑠𝑡and a training sampleS𝑡 𝑟. The search for hetereogeneity, i.e., the partitioning ofXinto relevant subsets ˆℓ, is conducted entirely over the training sampleS𝑡 𝑟. Once a suitable partition ofXis identified, it is taken as given and the group-specific treatment effects are re-estimated over the independent estimation sample that has been completely set aside up to that point. More formally, the eventual conditional average treatment effect estimates are of the form ˆ𝜏S𝑒𝑠𝑡(ℓˆS𝑡 𝑟), where the notation emphasizes the use of the two samples for different purposes. Inference based 10It is not hard to extend the following discussions to the more realistic case in which𝑚(𝑋)needs to be estimated. We also note that even when𝑚(𝑋)is known, it is more efficient to work with an estimated counterpart; see Hirano et al. (2003).

on ˆ𝜏S𝑒𝑠𝑡(ℓˆS𝑡 𝑟)can then proceed as if ˆℓS𝑡 𝑟 were fixed. While in the DML literature sample splitting is an enhancement, it is absolutely crucial in this setting.

The second modification concerns the MSE criterion used in the tree-building algorithm. The criterion function is used to compare candidate partitions, i.e., it is used to decide whether it is worth imposing additional splits on the data to estimate the CATE function in more detail. The proposed changes account for the fact that (i) instead of approximating a conditional expectation function the goal is to estimate treatment effects; and (ii) for any partition ˆΠconstructed fromS𝑡 𝑟, the corresponding conditional average treatment effects will be re-estimated usingS𝑒𝑠𝑡.

Technical discussion of the modified criterion. We now formally describe the proposed criterion function. Given a partitionΠ ={ℓ1, . . . , ℓ}ofXand a sampleS, let

ˆ

𝜏S(𝑥;Π)=

∑︁

𝑗=1

ˆ

𝜏S(ℓ𝑗)1𝑗(𝑥)

be the corresponding step function estimator of the CATE function𝜏(𝑥), where the value of ˆ𝜏S(𝑥;Π)is the constant ˆ𝜏S(ℓ𝑗)for𝑥∈ℓ𝑗. For a given𝑥∈ X, the MSE of the CATE estimator is𝐸[(𝜏(𝑥) −𝜏ˆS(𝑥;Π))2]; the proposed criterion function is based on the expected (average) MSE

EMSE(Π)=𝐸𝑋

𝑡,S𝑒𝑠𝑡

h

𝜏(𝑋𝑡) −𝜏ˆS𝑒𝑠𝑡(𝑋𝑡;Π)2i ,

where𝑋𝑡is a new, independently drawn,‘test’ observation. Thus, the goal is to choose the partitionΠin a way so that ˆ𝜏S𝑒𝑠𝑡(𝑥;Π)provides a good approximation to𝜏(𝑥) on average, where the averaging is with respect to the marginal distribution of𝑋.

While EMSE(Π)cannot be evaluated analytically, it can still be estimated. To this end, one can rewrite EMSE(Π)as11

EMSE(Π)=𝐸𝑋

𝑡

n

𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(𝑋𝑡;Π)]o

−𝐸[𝜏(𝑋𝑡;Π)2] +𝐸[𝜏(𝑋𝑡)2], (3.16) where𝑉S𝑒𝑠𝑡(·)denotes the variance operator with respect to the distribution of the sampleS𝑒𝑠𝑡and

𝜏(𝑥;Π)=

∑︁

𝑗=1

𝜏(ℓ𝑗)1𝑗(𝑥).

As the last term in (3.16) does not depend onΠ, it does not affect the choice of the optimal partition. We will henceforth drop this term from (3.16) and denote the remaining two terms as EMSE—a convenient and inconsequential abuse of notation.

Recall that the key idea is to complete the tree building process (i.e., the choice of the partitionΠ) on the basis of the training sample alone. Therefore, EMSE(Π)will be

11Equation (3.16) is derived in the Electronic Online Supplement,Section 3.1.The derivations assume that𝐸(𝜏ˆS(𝑗))=𝜏(𝑗), i.e., that the leaf-specific average treatment effect estimator is unbiased. This is true if the propensity score function is known but only approximately true otherwise.

3 The Use of Machine Learning in Treatment Effect Estimation 97 estimated usingS𝑡 𝑟; the only information used fromS𝑒𝑠𝑡is the sample size, denoted as #S𝑒𝑠𝑡.

We start with the expected variance term. It is given by

𝐸𝑋

𝑡

n

𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(𝑋𝑡;Π)]o

=

∑︁ 𝑗=1

𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(ℓ𝑗)]𝑃(𝑋𝑡∈ℓ𝑗). The variance of ˆ𝜏S𝑒𝑠𝑡(ℓ𝑗) is of the form𝜎2

𝑗/#ℓ𝑒𝑠𝑡

𝑗 , where #ℓ𝑒𝑠𝑡

𝑗 is the number of observations in the estimation sample falling in leafℓ𝑗 and

𝜎2

𝑗 =𝑉

𝑌𝑖𝐷𝑖/𝑝(𝑋𝑖) −𝑌𝑖(1−𝐷𝑖)/(1−𝑝(𝑋𝑖)) 𝑋𝑖∈ℓ𝑗

. Substituting𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(ℓ𝑗)]=𝜎2

𝑗/#ℓ𝑒𝑠𝑡

𝑗 into the expected variance equation yields 𝐸𝑋

𝑡

n

𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(𝑋𝑡;Π)]o

=

∑︁ 𝑗=1

𝜎2

𝑗

#ℓ𝑒𝑠𝑡

𝑗

𝑃(𝑋𝑡∈ℓ𝑗)

= 1

#S𝑒𝑠𝑡

∑︁

𝑗=1

𝜎2

𝑗𝑃(𝑋𝑡∈ℓ𝑗)#S𝑒𝑠𝑡

#ℓ𝑒𝑠𝑡

𝑗

. As #S𝑒𝑠𝑡/#ℓ𝑒𝑠𝑡

𝑗 ≈1/𝑃(𝑋𝑖∈ℓ𝑗)we can simply estimate the expected variance term by

ˆ 𝐸𝑋

𝑡

n

𝑉S𝑒𝑠𝑡[𝜏ˆS𝑒𝑠𝑡(𝑋𝑡;Π)]o

= 1

#S𝑒𝑠𝑡

∑︁ 𝑗=1

ˆ 𝜎2

𝑗 ,S𝑡 𝑟, (3.17) where ˆ𝜎2

𝑗 ,S𝑡 𝑟is a suitable (approximately unbiased) estimator of𝜎2

𝑗 over thetraining sample.

Turning to the second moment term in (3.16), note that for any sampleSand an independent observation𝑋𝑡,

𝐸𝑋[𝜏(𝑋𝑡;Π)2]=𝐸𝑋𝐸S[𝜏ˆS(𝑋𝑡;Π)2] −𝐸𝑋{𝑉S[𝜏ˆS(𝑋𝑡;Π)]}

because𝐸S[𝜏ˆS(𝑥;Π)]=𝜏(𝑥;Π)for any fixed point𝑥. Thus, an unbiased estimator of𝐸𝑋[𝜏(𝑋𝑡;Π)2]can be constructed from the training sampleS𝑡 𝑟 as

ˆ

𝐸𝑋[𝜏(𝑋𝑡;Π)2]= 1

#S𝑡 𝑟

∑︁

𝑖∈S𝑡 𝑟

ˆ

𝜏S𝑡 𝑟,−𝑖(𝑋𝑖;Π)2− 1

#S𝑡 𝑟

∑︁

𝑗=1

ˆ 𝜎2

𝑗 ,S𝑡 𝑟, (3.18) where ˆ𝜏S𝑡 𝑟,−𝑖is the leave-one-out version of ˆ𝜏S𝑡 𝑟and we use the analog of (3.17) to estimate the expected variance of ˆ𝜏S𝑡 𝑟(𝑋𝑡;Π). Combining the estimators (3.17) and (3.18) with the decomposition (3.16) gives the estimated EMSE criterion function

EMSE›(Π)= 1

#S𝑒𝑠𝑡+ 1

#S𝑡 𝑟

∑︁

𝑗=1

ˆ 𝜎2

𝑗 ,S𝑡 𝑟− 1

#S𝑡 𝑟

∑︁

𝑖∈S𝑡 𝑟

ˆ

𝜏S𝑡 𝑟,−𝑖(𝑋𝑖;Π)2. (3.19) The criterion function (3.19) has almost exactly the same form as in Athey and Imbens (2016), except that they do not use a leave-one-out estimator in the second term.

The intuition about how the criterion (3.19) works is straightforward. Say thatΠ andΠare two partitions whereΠis finer in the sense that there is an additional split along a given𝑋coordinate. If the two estimated treatment effects are not equal across this extra split, then the second term in the difference will decrease in absolute value, makingEMSE(Π› )ceteris paribus lower. However, the first term of the criterion also takes into account the fact that an additional split will result in leaves with fewer observations, increasing the variance of the ultimate CATE estimate. In other words, the first term will generally increase with an additional split and it is the net effect that determines whetherΠorΠis deemed as a better fit to the data.

The criterion function (3.19) is generalizable to other estimators. In fact, the precise structure of ˆ𝜏𝑆(ℓ)played almost no role in deriving (3.19); the only properties we made use of was unbiasedness (𝐸𝑆(𝜏ˆ𝑆(ℓ𝑗))=𝜏(ℓ𝑗)) and that the variance of

ˆ

𝜏𝑆(ℓ𝑗)is of the form𝜎2

𝑗/#ℓ𝑗. Hence, other types of estimators could be implemented in each leaf; for example, Reguly (2021) uses this insight to extend the framework to the parametric sharp regression discontinuity design.

3.4.3 Extensions and Technical Variations on the Causal Tree Approach Wager and Athey (2018) extend the causal tree approach of Athey and Imbens (2016) to causal forests, which are composed of causal trees with a conditional average treatment effect estimate in each leaf. To build a causal forest estimator, one first generates a large number of random subsamples and grows a causal tree on each subsample using a given procedure. The causal forest estimate of the CATE function is then obtained by averaging the estimates produced by the individual trees. Practically, a forest approach can reduce variance and smooth the estimate.

Wager and Athey (2018) propose two honest causal forest algorithms. One is based on double-sample trees and the other is based on propensity score trees. The construction of a double-sample tree involves a procedure similar to the one described in Section 4.2. First one draws without replacement𝐵subsamples of size𝑠=𝑂(𝑛𝜌) for some 0< 𝜌 <1 from the original data. Let these artificial samples be denoted asS𝑏, 𝑏=1, . . . , 𝐵. Then one splits eachS𝑏into two parts with size #S𝑡 𝑟

𝑏 =#S𝑒𝑠𝑡

𝑏 =𝑠/2.12 The tree is grown using theS𝑏𝑡 𝑟 data and the leafwise treatment effects are estimated using theS𝑏𝑒𝑠𝑡data, generating an individual estimator ˆ𝜏𝑏(𝑥). The splits of the tree are chosen by minimizing an expected MSE criterion analogous to (3.19), where S𝑏𝑡 𝑟 takes the role ofS𝑡 𝑟 andS𝑏𝑒𝑠𝑡takes the role ofS𝑒𝑠𝑡. For any fixed value𝑥, the causal forest CATE estimator ˆ𝜏(𝑥)is obtained by averaging the individual estimates 12We assume𝑠is an even number.

3 The Use of Machine Learning in Treatment Effect Estimation 99 ˆ

𝜏𝑏(𝑥), i.e., ˆ𝜏(𝑥)=𝐵−1Í𝐵

𝑏=1𝜏ˆ𝑏(𝑥). The propensity score tree procedure is similar except that one grows the tree using the whole subsampleS𝑏and uses the treatment assignment as the outcome variable. Wager and Athey (2018) show that the random forest estimates are asymptotically normally distributed and the asymptotic variance can be consistently estimated by infinitesimal jackknife so valid statistical inference is available.

Athey et al. (2019) propose a generalized random forest method that transforms the traditional random forest into a flexible procedure for estimating any unknown parameter identified via local moment conditions. The main idea is to use forest-based algorithms to learn the problem specific weights so as to be able to solve for the parameter of interest via a weighted local M-estimator. For each tree, the splitting decisions are based on a gradient tree algorithm; see Athey et al. (2019) for further details.

There are other methods proposed in the literature for the estimation of treatment effect heterogeneity such as the neural network based approaches by Yao et al. (2018) and Alaa, Weisz and van der Schaar (2017). These papers do not provide asymptotic theory and valid statistical inference is not available at the moment, so this is an interesting direction for future research.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 111-116)