The Causal Tree Approach - Using Machine Learning to Discover Treatment Effect HeterogeneityHet

The Use of Machine Learning in Treatment Effect Estimation

3.4 Using Machine Learning to Discover Treatment Effect HeterogeneityHeterogeneity

3.4.2 The Causal Tree Approach

Regression tree basics. Regression trees are algorithmically constructed step func-tions used to approximate conditional expectafunc-tions such as𝐸(𝑌|𝑋). More specifically, letΠ ={ℓ₁, ℓ₂, . . . , ℓ_#Π}be a partition ofX=𝑠𝑢 𝑝 𝑝 𝑜𝑟 𝑡(𝑋)and define

¯ 𝑌_ℓ

𝑗= Í^𝑛

𝑖=1𝑌_𝑖1^ℓ𝑗(𝑋_𝑖) Í^𝑛

𝑖=11ℓ_𝑗(𝑋_𝑖) , 𝑗=1, . . . ,#Π,

to be the average outcome for those observations𝑖for which𝑋_𝑖∈ℓ_𝑗. A regression tree estimates𝐸(𝑌|𝑋=𝑥)using a step function

ˆ 𝜇(𝑥;Π)=

∑︁#Π 𝑗=1

¯ 𝑌_ℓ

𝑗1ℓ_𝑗(𝑥). (3.14)

The regression tree algorithm considers partitionsΠthat are constructed based on recursive splits of the support of the components of𝑋. Thus, the subsetsℓ_𝑗, which are called the leaves of the tree, are given by intersections of sets of the form{𝑋_𝑘≤𝑐}or {𝑋_𝑘 > 𝑐}, where𝑋_𝑘 denotes the𝑘th component of𝑋. In building the regression tree, candidate partitions are evaluated through a mean squared error (MSE) criterion with an added term that penalizes the number of splits to avoid overfitting. Chapter 2 of this volume provides a more in-depth look at classification and regression trees.

From a regression tree to a causal tree. Under the unconfoundendess assumption one way to estimate ATE consistently is to use the inverse probability weighted estimator proposed by Hirano, Imbens and Ridder (2003):

ˆ 𝜏= 1

∑︁

𝑖∈S

𝑌_𝑖𝐷_𝑖

𝑚(𝑋_𝑖)−𝑌_𝑖(1−𝐷_𝑖) 1−𝑚(𝑋_𝑖)

, (3.15)

3 The Use of Machine Learning in Treatment Effect Estimation 95 whereSis the sample of observations on(𝑌_𝑖, 𝐷_𝑖, 𝑋_𝑖), #Sis the sample size, and𝑚(·) is the propensity score function, i.e., the conditional probability𝑚(𝑋)=𝑃(𝐷=1|𝑋).

To simplify the exposition, we will assume that the function𝑚(𝑋)is known.10 Given a subsetℓ⊂ X, one can also implement the estimator (3.15) in the subsample of observations for which 𝑋_𝑖∈ℓ, yielding an estimate of the conditional average treatment effect𝜏(ℓ)=𝐸[𝑌(1) −𝑌(0) |𝑋_𝑖∈ℓ]. More specifically, we define

𝜏_S(ℓ)= 1

#ℓ

∑︁

𝑖∈S, 𝑋𝑖∈ℓ

𝑌_𝑖𝐷_𝑖

𝑚(𝑋_𝑖)−𝑌_𝑖(1−𝐷_𝑖) 1−𝑚(𝑋_𝑖)

, where #ℓis the number of observations that fall inℓ.

Computing ˆ𝜏_S(ℓ)for various choices ofℓis called subgroup analysis. Based on subject matter theory or policy considerations, a researcher may pre-specify some subgroups of interest. For example, theory may predict that the effect changes as a function of age or income. Nevertheless, theory is rarely detailed enough to say exactly how to specify the relevant age or income groups and may not be able to say whether the two variables interact with each other or other variables. There may be situations, especially if the dimension of𝑋 is high, where the relevant subsets need to be discovered completely empirically by conducting as detailed a search as possible. However, it is now well understood that mining the data ‘by hand’ for relevant subgroups is problematic for two reasons. First, some of these groups may be complex and hard to discover, e.g., they may involve interactions between several variables. Moreover, if𝑋is large there are simply too many possibilities to consider.

Second, it is not clear how to conduct inference for groups uncovered by data mining.

The (asymptotic) distribution of ˆ𝜏(ℓ)is well understood for fixedℓ. But if the search procedure picks a group ˆℓbecause, say, ˆ𝜏(ℓˆ) −𝜏ˆis large, then the distribution of ˆ𝜏(ℓˆ) will of course differ from the fixedℓcase.

In their influential paper Athey and Imbens (2016) propose the use of the regression tree algorithm to search for treatment effect heterogeneity, i.e., to discover the relevant subsets ˆℓfrom the data itself. The resulting partition ˆΠand the estimates ˆ𝜏(ℓˆ), ˆℓ∈Πˆ are called a causal tree. Their key contribution is to modify to the standard regression tree algorithm in a way that accommodates treatment effect estimation (as opposed to prediction) and addresses the statistical inference problem discussed above.

The first proposed modification is what Athey and Imbens (2016) call an ‘honest’

approach. This consist of partitioning the available dataS={(𝑌_𝑖, 𝐷_𝑖, 𝑋_𝑖)}^𝑛

𝑖=1into an estimation sampleS^𝑒𝑠𝑡and a training sampleS^{𝑡 𝑟}. The search for hetereogeneity, i.e., the partitioning ofXinto relevant subsets ˆℓ, is conducted entirely over the training sampleS^{𝑡 𝑟}. Once a suitable partition ofXis identified, it is taken as given and the group-specific treatment effects are re-estimated over the independent estimation sample that has been completely set aside up to that point. More formally, the eventual conditional average treatment effect estimates are of the form ˆ𝜏_S𝑒𝑠𝑡(ℓˆ_S𝑡 𝑟), where the notation emphasizes the use of the two samples for different purposes. Inference based 10It is not hard to extend the following discussions to the more realistic case in which𝑚(𝑋)needs to be estimated. We also note that even when𝑚(𝑋)is known, it is more efficient to work with an estimated counterpart; see Hirano et al. (2003).

on ˆ𝜏_S𝑒𝑠𝑡(ℓˆ_S𝑡 𝑟)can then proceed as if ˆℓ_S𝑡 𝑟 were fixed. While in the DML literature sample splitting is an enhancement, it is absolutely crucial in this setting.

The second modification concerns the MSE criterion used in the tree-building algorithm. The criterion function is used to compare candidate partitions, i.e., it is used to decide whether it is worth imposing additional splits on the data to estimate the CATE function in more detail. The proposed changes account for the fact that (i) instead of approximating a conditional expectation function the goal is to estimate treatment effects; and (ii) for any partition ˆΠconstructed fromS^{𝑡 𝑟}, the corresponding conditional average treatment effects will be re-estimated usingS^𝑒𝑠𝑡.

Technical discussion of the modified criterion. We now formally describe the proposed criterion function. Given a partitionΠ ={ℓ₁, . . . , ℓ_#Π}ofXand a sampleS, let

𝜏_S(𝑥;Π)=

#Π

∑︁

𝑗=1

𝜏_S(ℓ_𝑗)1ℓ_𝑗(𝑥)

be the corresponding step function estimator of the CATE function𝜏(𝑥), where the value of ˆ𝜏_S(𝑥;Π)is the constant ˆ𝜏_S(ℓ_𝑗)for𝑥∈ℓ_𝑗. For a given𝑥∈ X, the MSE of the CATE estimator is𝐸[(𝜏(𝑥) −𝜏ˆ_S(𝑥;Π))²]; the proposed criterion function is based on the expected (average) MSE

EMSE(Π)=𝐸_𝑋

𝑡,S^𝑒𝑠𝑡

𝜏(𝑋_𝑡) −𝜏ˆ_S𝑒𝑠𝑡(𝑋_𝑡;Π)2i ,

where𝑋_𝑡is a new, independently drawn,‘test’ observation. Thus, the goal is to choose the partitionΠin a way so that ˆ𝜏_S𝑒𝑠𝑡(𝑥;Π)provides a good approximation to𝜏(𝑥) on average, where the averaging is with respect to the marginal distribution of𝑋.

While EMSE(Π)cannot be evaluated analytically, it can still be estimated. To this end, one can rewrite EMSE(Π)as11

EMSE(Π)=𝐸_𝑋

𝑡

𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(𝑋_𝑡;Π)]o

−𝐸[𝜏(𝑋_𝑡;Π)²] +𝐸[𝜏(𝑋_𝑡)²], (3.16) where𝑉_S𝑒𝑠𝑡(·)denotes the variance operator with respect to the distribution of the sampleS^𝑒𝑠𝑡and

𝜏(𝑥;Π)=

#Π

∑︁

𝑗=1

𝜏(ℓ_𝑗)1^ℓ𝑗(𝑥).

As the last term in (3.16) does not depend onΠ, it does not affect the choice of the optimal partition. We will henceforth drop this term from (3.16) and denote the remaining two terms as EMSE—a convenient and inconsequential abuse of notation.

Recall that the key idea is to complete the tree building process (i.e., the choice of the partitionΠ) on the basis of the training sample alone. Therefore, EMSE(Π)will be

11Equation (3.16) is derived in the Electronic Online Supplement,Section 3.1.The derivations assume that𝐸(𝜏ˆ_S(ℓ_𝑗))=𝜏(ℓ_𝑗), i.e., that the leaf-specific average treatment effect estimator is unbiased. This is true if the propensity score function is known but only approximately true otherwise.

3 The Use of Machine Learning in Treatment Effect Estimation 97 estimated usingS^{𝑡 𝑟}; the only information used fromS^𝑒𝑠𝑡is the sample size, denoted as #S^𝑒𝑠𝑡.

We start with the expected variance term. It is given by

𝐸_𝑋

𝑡

𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(𝑋_𝑡;Π)]o

∑︁#Π 𝑗=1

𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(ℓ_𝑗)]𝑃(𝑋_𝑡∈ℓ_𝑗). The variance of ˆ𝜏_S𝑒𝑠𝑡(ℓ_𝑗) is of the form𝜎²

𝑗/#ℓ^𝑒𝑠𝑡

𝑗 , where #ℓ^𝑒𝑠𝑡

𝑗 is the number of observations in the estimation sample falling in leafℓ_𝑗 and

𝜎²

𝑗 =𝑉

𝑌_𝑖𝐷_𝑖/𝑝(𝑋_𝑖) −𝑌_𝑖(1−𝐷_𝑖)/(1−𝑝(𝑋_𝑖)) 𝑋_𝑖∈ℓ_𝑗

. Substituting𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(ℓ_𝑗)]=𝜎²

𝑗/#ℓ^𝑒𝑠𝑡

𝑗 into the expected variance equation yields 𝐸_𝑋

𝑡

𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(𝑋_𝑡;Π)]o

∑︁#Π 𝑗=1

𝜎²

𝑗

#ℓ^𝑒𝑠𝑡

𝑗

𝑃(𝑋_𝑡∈ℓ_𝑗)

= 1

#S^𝑒𝑠𝑡

#Π

∑︁

𝑗=1

𝜎²

𝑗𝑃(𝑋_𝑡∈ℓ_𝑗)#S^𝑒𝑠𝑡

#ℓ^𝑒𝑠𝑡

𝑗

. As #S^𝑒𝑠𝑡/#ℓ^𝑒𝑠𝑡

𝑗 ≈1/𝑃(𝑋_𝑖∈ℓ_𝑗)we can simply estimate the expected variance term by

ˆ 𝐸_𝑋

𝑡

𝑉_S𝑒𝑠𝑡[𝜏ˆ_S𝑒𝑠𝑡(𝑋_𝑡;Π)]o

= 1

#S^𝑒𝑠𝑡

∑︁#Π 𝑗=1

ˆ 𝜎²

𝑗 ,S^{𝑡 𝑟}, (3.17) where ˆ𝜎²

𝑗 ,S^{𝑡 𝑟}is a suitable (approximately unbiased) estimator of𝜎²

𝑗 over thetraining sample.

Turning to the second moment term in (3.16), note that for any sampleSand an independent observation𝑋_𝑡,

𝐸_𝑋[𝜏(𝑋_𝑡;Π)²]=𝐸_𝑋𝐸_S[𝜏ˆ_S(𝑋_𝑡;Π)²] −𝐸_𝑋{𝑉_S[𝜏ˆ_S(𝑋_𝑡;Π)]}

because𝐸_S[𝜏ˆ_S(𝑥;Π)]=𝜏(𝑥;Π)for any fixed point𝑥. Thus, an unbiased estimator of𝐸_𝑋[𝜏(𝑋_𝑡;Π)²]can be constructed from the training sampleS^{𝑡 𝑟} as

𝐸_𝑋[𝜏(𝑋_𝑡;Π)²]= 1

#S^{𝑡 𝑟}

∑︁

𝑖∈S^{𝑡 𝑟}

𝜏_S𝑡 𝑟,−𝑖(𝑋_𝑖;Π)²− 1

#S^{𝑡 𝑟}

#Π

∑︁

𝑗=1

ˆ 𝜎²

𝑗 ,S^{𝑡 𝑟}, (3.18) where ˆ𝜏_S𝑡 𝑟,−𝑖is the leave-one-out version of ˆ𝜏_S𝑡 𝑟and we use the analog of (3.17) to estimate the expected variance of ˆ𝜏_S𝑡 𝑟(𝑋_𝑡;Π). Combining the estimators (3.17) and (3.18) with the decomposition (3.16) gives the estimated EMSE criterion function

EMSE(Π)= 1

#S^𝑒𝑠𝑡+ 1

#S^{𝑡 𝑟} #Π

∑︁

𝑗=1

ˆ 𝜎²

𝑗 ,S^{𝑡 𝑟}− 1

#S^{𝑡 𝑟}

∑︁

𝑖∈S^{𝑡 𝑟}

𝜏_S𝑡 𝑟,−𝑖(𝑋_𝑖;Π)². (3.19) The criterion function (3.19) has almost exactly the same form as in Athey and Imbens (2016), except that they do not use a leave-one-out estimator in the second term.

The intuition about how the criterion (3.19) works is straightforward. Say thatΠ andΠ^′are two partitions whereΠ^′is finer in the sense that there is an additional split along a given𝑋coordinate. If the two estimated treatment effects are not equal across this extra split, then the second term in the difference will decrease in absolute value, makingEMSE(Π ^′)ceteris paribus lower. However, the first term of the criterion also takes into account the fact that an additional split will result in leaves with fewer observations, increasing the variance of the ultimate CATE estimate. In other words, the first term will generally increase with an additional split and it is the net effect that determines whetherΠorΠ^′is deemed as a better fit to the data.

The criterion function (3.19) is generalizable to other estimators. In fact, the precise structure of ˆ𝜏_𝑆(ℓ)played almost no role in deriving (3.19); the only properties we made use of was unbiasedness (𝐸_𝑆(𝜏ˆ_𝑆(ℓ_𝑗))=𝜏(ℓ_𝑗)) and that the variance of

𝜏_𝑆(ℓ_𝑗)is of the form𝜎²

𝑗/#ℓ_𝑗. Hence, other types of estimators could be implemented in each leaf; for example, Reguly (2021) uses this insight to extend the framework to the parametric sharp regression discontinuity design.

3.4.3 Extensions and Technical Variations on the Causal Tree Approach Wager and Athey (2018) extend the causal tree approach of Athey and Imbens (2016) to causal forests, which are composed of causal trees with a conditional average treatment effect estimate in each leaf. To build a causal forest estimator, one first generates a large number of random subsamples and grows a causal tree on each subsample using a given procedure. The causal forest estimate of the CATE function is then obtained by averaging the estimates produced by the individual trees. Practically, a forest approach can reduce variance and smooth the estimate.

Wager and Athey (2018) propose two honest causal forest algorithms. One is based on double-sample trees and the other is based on propensity score trees. The construction of a double-sample tree involves a procedure similar to the one described in Section 4.2. First one draws without replacement𝐵subsamples of size𝑠=𝑂(𝑛^𝜌) for some 0< 𝜌 <1 from the original data. Let these artificial samples be denoted asS𝑏, 𝑏=1, . . . , 𝐵. Then one splits eachS𝑏into two parts with size #S^{𝑡 𝑟}

𝑏 =#S^𝑒𝑠𝑡

𝑏 =𝑠/2.12 The tree is grown using theS_𝑏^{𝑡 𝑟} data and the leafwise treatment effects are estimated using theS_𝑏^𝑒𝑠𝑡data, generating an individual estimator ˆ𝜏_𝑏(𝑥). The splits of the tree are chosen by minimizing an expected MSE criterion analogous to (3.19), where S_𝑏^{𝑡 𝑟} takes the role ofS^{𝑡 𝑟} andS_𝑏^𝑒𝑠𝑡takes the role ofS^𝑒𝑠𝑡. For any fixed value𝑥, the causal forest CATE estimator ˆ𝜏(𝑥)is obtained by averaging the individual estimates 12We assume𝑠is an even number.

3 The Use of Machine Learning in Treatment Effect Estimation 99 ˆ

𝜏_𝑏(𝑥), i.e., ˆ𝜏(𝑥)=𝐵⁻¹Í^𝐵

𝑏=1𝜏ˆ_𝑏(𝑥). The propensity score tree procedure is similar except that one grows the tree using the whole subsampleS𝑏and uses the treatment assignment as the outcome variable. Wager and Athey (2018) show that the random forest estimates are asymptotically normally distributed and the asymptotic variance can be consistently estimated by infinitesimal jackknife so valid statistical inference is available.

Athey et al. (2019) propose a generalized random forest method that transforms the traditional random forest into a flexible procedure for estimating any unknown parameter identified via local moment conditions. The main idea is to use forest-based algorithms to learn the problem specific weights so as to be able to solve for the parameter of interest via a weighted local M-estimator. For each tree, the splitting decisions are based on a gradient tree algorithm; see Athey et al. (2019) for further details.

There are other methods proposed in the literature for the estimation of treatment effect heterogeneity such as the neural network based approaches by Yao et al. (2018) and Alaa, Weisz and van der Schaar (2017). These papers do not provide asymptotic theory and valid statistical inference is not available at the moment, so this is an interesting direction for future research.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 111-116)