18Estimating Average Treatment E¤ects
18.1 Introduction
In this chapter we explicitly study the problem of estimating an average treatment e¤ect (ATE). An average treatment e¤ect is a special case of an average partial e¤ect: an ATE is an average partial e¤ect for a binary explanatory variable.
Estimating ATEs has become important in the program evaluation literature, such as in the evaluation of job training programs. Originally, the binary indicators repre-sented medical treatment or program participation, but the methods are applicable when the explanatory variable of interest is any binary variable.
We begin by introducing a counterfactual framework pioneered by Rubin (1974) and since adopted by many in both statistics and econometrics, including Rosen-baum and Rubin (1983), Heckman (1992, 1997), Imbens and Angrist (1994), Angrist, Imbens, and Rubin (1996), Manski (1996), Heckman, Ichimura, and Todd (1997), and Angrist (1998). The counterfactual framework allows us to deﬁne various treatment e¤ects that may be of interest. Once we deﬁne the di¤erent treatment e¤ects, we can study ways to consistently estimate these e¤ects. We will not provide a comprehensive treatment of this rapidly growing literature, but we will show that, under certain as-sumptions, estimators that we are already familiar with consistently estimate average treatment e¤ects. We will also study some extensions that consistently estimate ATEs under weaker assumptions.
Broadly, most estimators of ATEs ﬁt into one of two categories. The ﬁrst set exploits assumptions concerning ignorability of the treatment conditional on a set of covariates. As we will see in Section 18.3, this approach is analogous to the proxy variable solution to the omitted variables problem that we discussed in Chapter 4, and in some cases reduces exactly to an OLS regression with many controls. A sec-ond set of estimators relies on the availability of one or more instrumental variables that are redundant in the response equations but help determine participation. Dif-ferent IV estimators are available depending on functional form assumptions con-cerning how unobserved heterogeneity a¤ects the responses. We study IV estimators in Section 18.4.
In Section 18.5 we brieﬂy discuss some further topics, including special consid-erations for binary and corner solution responses, using panel data to estimate treat-ment e¤ects, and nonbinary treatments.
18.2 A Counterfactual Setting and the Self-Selection Problem
The modern literature on treatment e¤ects begins with a counterfactual, where each individual (or other agent) has an outcome with and without treatment (where
604 Chapter 18
‘‘treatment’’ is interpreted very broadly). This section draws heavily on Heckman (1992, 1997), Imbens and Angrist (1994), and Angrist, Imbens, and Rubin (1996)
(hereafter AIR). Let y1 denote the outcome with treatment and y0 the outcome with-out treatment. Because an individual cannot be in both states, we cannot observe
both y0 and y1; in e¤ect, the problem we face is one of missing data.
It is important to see that we have made no assumptions about the distributions of
y0 and y1. In many cases these may be roughly continuously distributed (such as salary), but often y0 and y1 are binary outcomes (such as a welfare participation in-dicator), or even corner solution outcomes (such as married women’s labor supply).
However, some of the assumptions we make will be less plausible for discontinuous random variables, something we discuss after introducing the assumptions.
The following discussion assumes that we have an independent, identically distri-buted sample from the population. This assumption rules out cases where the treat-ment of one unit a¤ects another’s outcome (possibly through general equilibrium e¤ects, as in Heckman, Lochner, and Taber, 1998). The assumption that treatment of unit i a¤ects only the outcome of unit i is called the stable unit treatment value assumption (SUTVA) in the treatment literature (see, for example, AIR). We are making a stronger assumption because random sampling implies SUTVA.
Let the variable w be a binary treatment indicator, where w ¼ 1 denotes treatment
and w ¼ 0 otherwise. The triple ðy0; y1;wÞ represents a random vector from the underlying population of interest. For a random draw i from the population, we write ðyi0; yi1;wiÞ. However, as we have throughout, we state assumptions in terms of the population.
To measure the e¤ect of treatment, we are interested in the di¤erence in the out-comes with and without treatment, y1 ÿy0. Because this is a random variable (that is, it is individual speciﬁc), we must be clear about what feature of its distribution
we want to estimate. Several possibilities have been suggested in the literature. In Rosenbaum and Rubin (1983), the quantity of interest is the average treatment e¤ect (ATE),
ATE 1Eðy1 ÿ y0Þ ð18:1Þ
ATE is the expected e¤ect of treatment on a randomly drawn person from the pop-ulation. Some have criticized this measure as not being especially relevant for policy purposes: because it averages across the entire population, it includes in the average units who would never be eligible for treatment. Heckman (1997) gives the example of a job training program, where we would not want to include millionaires in com-puting the average e¤ect of a job training program. This criticism is somewhat mis-leading, as we can—and would—exclude people from the population who would never be eligible. For example, in evaluating a job training program, we might re-
Estimating Average Treatment E¤ects 605
strict attention to people whose pretraining income is below a certain threshold; wealthy people would be excluded precisely because we have no interest in how job training a¤ects the wealthy. In evaluating the beneﬁts of a program such as Head Start, we could restrict the population to those who are actually eligible for the pro-gram or are likely to be eligible in the future. In evaluating the e¤ectiveness of en-terprise zones, we could restrict our analysis to block groups whose unemployment rates are above a certain threshold or whose per capita incomes are below a certain level.
A second quantity of interest, and one that has received much recent attention, is the average treatment e¤ect on the treated, which we denote ATE1:
ATE1 1Eðy1 ÿ y0 jw ¼ 1Þ ð18:2Þ
That is, ATE1 is the mean e¤ect for those who actually participated in the program. As we will see, in some special cases equations (18.1) and (18.2) are equivalent, but generally they di¤er.
Imbens and Angrist (1994) deﬁne another treatment e¤ect, which they call a local average treatment e¤ect (LATE). LATE has the advantage of being estimable using instrumental variables under very weak conditions. It has two potential drawbacks: (1) it measures the e¤ect of treatment on a generally unidentiﬁable subpopulation; and (2) the deﬁnition of LATE depends on the particular instrumental variable that we have available. We will discuss LATE in the simplest setting in Section 18.4.2.
We can expand the deﬁnition of both treatment e¤ects by conditioning on covari-
ates. If x is an observed covariate, the ATE conditional on x is simply Eðy1 ÿ y0 jxÞ; similarly, equation (18.2) becomes Eðy1 ÿ y0 jx;w ¼ 1Þ. By choosing x appropri-ately, we can deﬁne ATEs for various subsets of the population. For example, x
can be pretraining income or a binary variable indicating poverty status, race, or gender. For the most part, we will focus on ATE and ATE1 without conditioning on covariates.
As noted previously, the di‰culty in estimating equation (18.1) or (18.2) is that we
observe only y0 or y1, not both, for each person. More precisely, along with w, the observed outcome is
y ¼ ð1 ÿ wÞy0 þ wy1 ¼ y0 þ wðy1 ÿ y0Þ ð18:3Þ
Therefore, the question is, How can we estimate equation (18.1) or (18.2) with a random sample on y and w (and usually some observed covariates)?
First, suppose that the treatment indicator w is statistically independent of ðy0;y1Þ, as would occur when treatment is randomized across agents. One implication of
independence between treatment status and the potential outcomes is that ATE and ATE1 are identical: Eðy1 ÿ y0 jw ¼ 1Þ ¼ Eðy1 ÿ y0Þ. Furthermore, estimation of
606 Chapter 18
ATE is simple. Using equation (18.3), we have
Eðyjw ¼ 1Þ ¼ Eðy1 jw ¼ 1Þ ¼ Eðy1Þ
where the last equality follows because y1 and w are independent. Similarly,
Eðyjw ¼ 0Þ ¼ Eðy0 jw ¼ 0Þ ¼ Eðy0Þ
It follows that
ATE ¼ ATE1 ¼ Eðyjw ¼ 1Þ ÿ Eðyjw ¼ 0Þ ð18:4Þ
The right-hand side is easily estimated by a di¤erence in sample means: the sample average of y for the treated units minus the sample average of y for the untreated units. Thus, randomized treatment guarantees that the di¤erence-in-means estimator from basic statistics is unbiased, consistent, and asymptotically normal. In fact, these properties are preserved under the weaker assumption of mean independence:
Eðy0 jwÞ ¼ Eðy0Þ and Eðy1 jwÞ ¼ Eðy1Þ.
Randomization of treatment is often infeasible in program evaluation (although
randomization of eligibility often is feasible; more on this topic later). In most cases, individuals at least partly determine whether they receive treatment, and their deci-
sions may be related to the beneﬁts of treatment, y1 ÿ y0. In other words, there is self-selection into treatment.
It turns out that ATE1 can be consistently estimated as a di¤erence in means under the weaker assumption that w is independent of y0, without placing any restriction on the relationship between w and y1. To see this point, note that we can always write
Eðyjw ¼ 1Þ ÿ Eðyjw ¼ 0Þ ¼ Eðy0 jw ¼ 1Þ ÿ Eðy0 jw ¼ 0Þ þ Eðy1 ÿ y0 jw ¼ 1Þ
¼ ½Eðy0 jw ¼ 1Þ ÿ Eðy0 jw ¼ 0Þ þATE1 ð18:5Þ
If y0 is mean independent of w, that is,
Eðy0 jwÞ ¼ Eðy0Þ ð18:6Þ
then the ﬁrst term in equation (18.5) disappears, and so the di¤erence in means esti-mator is an unbiased estimator of ATE1. Unfortunately, condition (18.6) is a strong assumption. For example, suppose that people are randomly made eligible for a voluntary job training program. Condition (18.6) e¤ectively implies that the partici-pation decision is unrelated to what people would earn in the absence of the program.
A useful expression relating ATE1 and ATE is obtained by writing y0 ¼ m0 þv0 and y1 ¼ m1 þ v1, where mg ¼ Eðy Þ, g ¼ 0;1. Then
y1 ÿ y0 ¼ ðm1 ÿ m0Þ þ ðv1 ÿ v0Þ ¼ ATE þ ðv1 ÿ v0Þ
Estimating Average Treatment E¤ects 607
Taking the expectation of this equation conditional on w ¼ 1 gives
ATE1 ¼ ATE þ Eðv1 ÿ v0 jw ¼ 1Þ
We can think of v1 ÿv0 as the person-speciﬁc gain from participation, and so ATE1 di¤ers from ATE by the expected person-speciﬁc gain for those who participated. If y1 ÿ y0 is not mean independent of w, ATE1 and ATE generally di¤er.
Fortunately, we can estimate ATE and ATE1 under assumptions less restrictive than independence of ðy0;y1Þ and w. In most cases, we can collect data on individ-ual characteristics and relevant pretreatment outcomes—sometimes a substantial
amount of data. If, in an appropriate sense, treatment depends on the observables and not on the unobservables determining ðy0;y1Þ, then we can estimate average treatment e¤ects quite generally, as we show in the next section.
18.3 Methods Assuming Ignorability of Treatment
We adopt the framework of the previous section, and, in addition, we let x denote a
vector of observed covariates. Therefore, the population is described by ðy0;y1;w;xÞ, and we observe y, w, and x, where y is given by equation (18.3). When w and ðy0;y1Þ are allowed to be correlated, we need an assumption in order to identify treatment
e¤ects. Rosenbaum and Rubin (1983) introduced the following assumption, which they called ignorability of treatment (given observed covariates x):
assumption ATE.1: Conditional on x, w and ðy0;y1Þ are independent.
For many purposes, it su‰ces to assume ignorability in a conditional mean indepen-dence sense:
assumption ATE.10: (a) Eðy0 jx;wÞ ¼ Eðy0 jxÞ; and (b) Eðy1 jx;wÞ ¼ Eðy1 jxÞ.
Naturally, Assumption ATE.1 implies Assumption ATE.10. In practice, Assumption
ATE.10 might not a¤ord much generality, although it does allow Varðy0 jx;wÞ and Varðy1 jx;wÞ to depend on w. The idea underlying Assumption ATE.10 is this: if we can observe enough information (contained in x) that determines treatment, then
ðy0;y1Þ might be mean independent of w, conditional on x. Loosely, even though ðy0;y1Þ and w might be correlated, they are uncorrelated once we partial out x.
Assumption ATE.1 certainly holds if w is a deterministic function of x, which has
prompted some authors in econometrics to call assumptions like ATE.1 selection on observables; see, for example, Barnow, Cain, and Goldberger (1980, 1981), Heckman and Robb (1985), and Mo‰tt (1996). (We discussed a similar assumption in Section
...
- tailieumienphi.vn

nguon tai.lieu . vn