Empirical Illustration - The Use of Machine Learning in Treatment Effect Estimation

The Use of Machine Learning in Treatment Effect Estimation

3.5 Empirical Illustration

We revisit the application in Abrevaya et al. (2015) and Fan et al. (2020) and study the effect of maternal smoking during pregnancy on the baby’s birthweight. The data set comes from vital statistics records in North Carolina between 1988 and 2002 and is large both in terms of the number of observations and the available covariates.

We restrict the sample to first time mothers and, following the literature, analyze the black and caucasian (white) subsamples separately.16The two sample sizes are 157,989 and 433,558, respectively.

The data set contains a rich set of individual-level covariates describing the mother’s socioeconomic status and medical history, including the progress of the pregnancy. This is supplemented by zip-code level information corresponding to the mother’s residence (e.g., per capita income, population density, etc.). Table 3.1 summarizes the definition of the dependent variable(𝑌), the treatment variable(𝐷) and the control variables used in the analysis. We divide the latter variables into a primary set𝑋₁, which includes the more important covariates and some of their powers and interactions, and a secondary set𝑋₂, which includes auxiliary controls and their transformations.17Altogether,𝑋₁includes 26 variables while the union of 𝑋₁and𝑋₂has 679 components for the black subsample and 743 for the caucasian.

16The reason for focusing on first time mothers is that in case of previous deliveries we cannot identify the data point that corresponds to it. Birth outcomes across the same mother at different points in time are more likely to be affected by unobserved heterogeneity (see Abrevaya et al., 2015 for a more detailed discussion).

17As𝑋₁and𝑋₂already include transformations of the raw covariates, these vectors correspond to the the dictionary𝑏(𝑋)in the notation of Section 3.3.

Table 3.1:The variables used in the empirical exercise

𝑌 bweight birth weight of the baby (in grams) 𝐷 smoke if mother smoked during pregnancy (1 if yes)

𝑋₁

mage mother’s age (in years)

meduc mother’s education (in years) prenatal month of first prenatal visit prenatal_visits number of prenatal visits

male baby’s gender (1 if male)

married if mother is married (1 if yes)

drink if mother used alcohol during pregnancy (1 if yes) diabetes if mother has diabetes (1 if yes) hyperpr if mother has high blood pressure (1 if yes)

amnio if amniocentesis test (1 if yes) ultra if ultrasound during pregnancy (1 if yes) dterms previous terminated pregnancies (1 if yes) fagemiss if father’s age is missing (1 if yes)

polynomials: mage²

interactions: mage×

meduc, prenatal, prenatal_visits, male, married, drink, diabetes, hyperpr, amnio,

ultra, dterms, fagemiss

𝑋₂

mom_zip zip code of mother’s residence (as a series of dummies) byear birth year 1988-2002 (as series of dummies) anemia if mother had anemia (1 if yes) med_inc median income in mother’s zip code

pc_inc per capita income in mother’s zip code popdens population density in mother’s zip code

fage father’s age (in years)

feduc father’s education (in years) feducmisss if father’s education is missing (1 if yes) polynomials: fage², fage³, mage³

We present two exercises. First, we use the DML approach to estimate𝜏, i.e., the average effect (ATE) of smoking. This is a well-studied problem with several estimates available in the literature using various data sets and methods (see e.g., Abrevaya, 2006, Da Veiga & Wilder, 2008 and Walker, Tekin & Wallace, 2009). The point estimates range from about−120 to−250 grams with the magnitude of the effect being smaller for blacks than whites. Second, we use the causal tree approach to search for covariates that drive heterogeneity in the treatment effect and explore the full-dimensional CATE function𝜏(𝑋). To our knowledge, this is a new exercise;

both Abrevaya et al. (2015) and Fan et al. (2020) focus on the reduced dimensional CATE function with mother’s age as the pre-specified variable of interest.

The findings from the first exercise for black mothers are presented in Table 3.2;

the corresponding results for white mothers can be found in the Electronic Online Supplement,Table 3.1.The left panel shows a simple setup where we use𝑋₁as the

3 The Use of Machine Learning in Treatment Effect Estimation 103 Table 3.2:Estimates of𝜏for black mothers

Basic setup Extended setup Point-estimate SE Point-estimate SE

OLS -132.3635 6.3348 -130.2478 6.3603

Naive Lasso(𝜆^∗) -131.6438 - -132.3003 -Naive Lasso (0.5𝜆^∗) -131.6225 - -130.7610 -Naive Lasso (2𝜆^∗) -131.4569 - -134.6669 -Post-naive-Lasso (𝜆^∗) -132.3635 6.3348 -128.6592 6.2784 Post-naive-Lasso (0.5𝜆^∗) -132.3635 6.3348 -130.7227 6.2686 Post-naive-Lasso (2𝜆^∗) -132.3635 6.3348 -130.2032 6.3288 DML (𝜆^∗) -132.0897 6.3345 -129.9787 6.3439 DML (0.5𝜆^∗) -132.1474 6.3352 -128.8866 6.3413 DML (2𝜆^∗) -132.1361 6.3344 -131.5680 6.3436 DML-package -132.0311 6.3348 -131.0801 6.3421

Notes: Sample size=157,989.𝜆^∗denotes Lasso penalties obtained by 5-fold cross validation. The DML estimators are implemented by 2-fold cross-fitting. The row titled ‘DML-package’ contains the estimate obtained by using the ‘official’ DML code (dml2) available at https://docs.doubleml.org/r/

stable/. All other estimators are programmed directly by the authors.

vector of controls, whereas the right panel works with the extended set𝑋₁∪𝑋₂. In addition to the DML estimates, we report several benchmarks: (i) the OLS estimate of ATE from a regression of𝑌 on𝐷and the set of controls; (ii) the corresponding direct/naive Lasso estimates of ATE (with the treatment dummy excluded from the penalty); and (iii) post-Lasso estimates where the model selected by the Lasso is re-estimated by OLS. The symbol𝜆^∗denotes Lasso penalty levels chosen by 5-fold cross validation; we also report the results for the levels𝜆^∗/2 and 2𝜆^∗. The DML estimators are implemented using 2-fold cross-fitting.

The two main takeaways from Table 3.2 are that the estimated ‘smoking effect’

is, on average,−130 grams for first time black mothers (which is consistent with previous estimates in the literature), and that this estimate is remarkably stable across the various methods. This includes the OLS benchmark with the basic and extended set of controls as well as the corresponding naive Lasso and post-Lasso estimators.

We find that even at the penalty level 2𝜆^∗the naive Lasso keeps most of the covariates and so do the first stage Lasso estimates from the DML procedure. This is due to the fact that even small coefficients are precisely estimated in such large samples, so one does not achieve a meaningful reduction in variance by setting them to zero.

As the addition of virtually any covariate improves the cross-validated MSE by a small amount, the optimal penalty𝜆^∗is small. With close to the full set of covariates utilized in either setup, and a limited amount of shrinkage due to𝜆^∗being small, there

is little difference across the methods.18In addition, using𝑋₁∪𝑋₂versus𝑋₁alone affects the size of the point estimates only to a small degree—the post-Lasso and DML estimates based on the extended setup are 1 to 3 grams smaller in magnitude.

Physically this is a very small difference, though still a considerable fraction of the standard error.

It is also noteworthy that the standard errors provided by the post-Lasso estimator are very similar to the DML standard errors. Hence, in the present situation naively conducting inference using the post-Lasso estimator would lead to the proper conclusions. Of course, relying on this estimator for inference is still bad practice as there are noa prioritheoretical guarantees for it to be unbiased or asymptotically normal.

The results for white mothers (see the Electronic Online Supplement,Table 3.1.) follow precisely the same patterns as discussed above—the various methods are in agreement and the basic and extended model setups deliver the same results. The only difference is that the point estimate of the average smoking effect is about−208 grams.

The output from the second exercise is displayed in Figure 3.1, which shows the estimated CATE function for black mothers represented as a tree (the corresponding figure for white mothers is in the Electronic Online Supplement,Figure 3.1). The two most important leaves are in the bottom right corner, containing about 92% of the observations in total. These leaves originate from a parent node that splits the sample by mother’s age falling below or above 19 years; the average smoking effect is then significantly larger in absolute value for the older group (−221.7 grams) than for the younger group (−120.1 grams).

This result is fully consistent with Abrevaya et al. (2015) and Fan et al. (2020), who estimate the reduced dimensional CATE function for mother’s age and find that the smoking effect becomes more detrimental for older mothers. The conceptual difference between these two papers and the tree in Figure 3.1 is twofold. First, in building the tree, age is not designateda priorias a variable of interest; it is the tree building algorithm that recognized its relevance for treatment effect heterogeneity.

Second, not all other control variables are averaged out in the two leaves highlighted above; in fact, the preceding two splits condition on normal blood pressure as well as the mother being older than 15. However, given the small size of the complementing leaves, we would caution against over-interpreting the rest of the tree. Whether or not blood pressure is related to the smoking effect in a stable way would require some further robustness checks. We nevertheless note that even within the high blood pressure group the age-related pattern is qualitatively similar—smoking has a larger negative effect for older mothers and is in fact statistically insignificant below age 22.

The results for white mothers (see the Electronic Online Supplement,Figure 3.1.) are qualitatively similar in that they confirm that the smoking effect becomes more 18Another factor that contributes to this result is that no individual covariate is strongly correlated with the treatment dummy. Hence, even if one were mistakenly dropped from the naive Lasso regression, it would not create substantial bias. Indeed, in exercises run on small subsamples (so that the𝑑𝑖 𝑚(𝑋)/𝑛ratio is much larger) we still find little difference between the naive (post) Lasso and the DML method in this application.

3 The Use of Machine Learning in Treatment Effect Estimation 105

hyperpr = 1

mage<15

6.4[101.1], (2%) mage≥19

−221.7[9.94](64%) −120.1[19.2], (27%) mage<22

85.2[60.1], (3%) −163.0[54.5], (3%) yes no

yes no

Fig. 3.1:A causal tree for the effect of smoking on birth weight (first time black mothers).

Notes: standard errors are in brackets and the percentages in parenthesis denote share of group in the sample. The total number of observations is𝑁=157,989 and the covariates used are𝑋₁, except for the polynomial and interaction terms. To obtain a simpler model, we choose the pruning parameter using the 1SE-rule rather than the minimum cross-validated MSE.

negative with age (compare the two largest leaves on the𝑚 𝑎𝑔 𝑒≥28 and𝑚 𝑎𝑔 𝑒 <28 branches). Hypertension appears again as a potentially relevant variable, but the share of young women affected by this problem is small. We also note that the results are obtained over a subsample of𝑁=150,000 and robustness checks show sensitivity to the choice of the subsample. Nonetheless, the age pattern is qualitatively stable.

The preceding results illustrate both the strengths and weaknesses of using a causal tree for heterogeneity analysis. On the one hand, letting the data speak for itself is philosophically attractive and can certainly be useful. On the other hand, the estimation results may appear to be too complex or arbitrary and can be challenging to interpret. This problem is exacerbated by the fact that trees grown on different subsamples can show different patterns — suggesting that in practice it is prudent to construct a causal forest, i.e., use the average of multiple trees.

In document ECONOMETRICS with MACHINE LEARNING (Pldal 118-122)