We’ve already seen how we need two particular restrictions on the relationship between potential outcomes and the treatment assignment process for causal identification: (conditional) independence of potential oucomes from treatment status and SUTVA (or a similar restriction). We’ll now introduce an additional assumption: covariate overlap, which will allow us to use a particularly simple approach to causal estimation: running a regression.
First, let’s review conditional independence of potential oucomes from treatment status, otherwise known as unconfoundedness. This assumption states
\[(Y_i(0),Y_i(1))\perp D_i|X_i\]
People will often state this assumption as “treatment is as-good-as-randomly assigned, conditional on \(X_i\)”. However, another way to think of it that I find useful (and I think will be particularly useful when we get to propensity score methods in a week or two) is that if you know \(X_i\), you cannot learn anything from \(Y_i(1-D_i)\) about how \(D_i\) was assigned. This is a powerful result because it says we can compare \(E[Y_i|D_i=0,X_i=x]\) and \(E[Y_i|D_i=1,X_i=x]\) and learn the average treatment effect at \(x\), since there cannot be any selection that is not accounted for by \(X_i\)!
In practice, it may be difficult to find a meaningful number of observations with either treatment status at every observed value of \(X_i\), and in fact the assumptions we’ve made so far don’t rule out that it’s completely impossible. In particular, this is the case if
\[\{X_i:D_i=1\}\cap\{X_i:D_i=0\}=\emptyset\]
To make headway, we can bump up the unconfoundedness assumption to strong unconfoundedness by additionally assuming covariate overlap:
\[0 < P(D_i=1|X_i) < 1\]
Covariate overlap enforces that we have treated and untreated observations at every point in the covariate distribution, at least in a sufficiently large sample. Equipped with strong unconfoundedness we know that we should always be able to calculate \(E[Y_i(1)-Y_i(0)|X_i]\) by finding \(E[Y_i|X_i,D_i]\). One tool for such a task is OLS regression. As shown in the class notes, OLS is the lowest MSE linear approximator of the CEF, and as such, we might expect it to do a good job in this case. As also shown in the class notes, how well it does when we have a simple additive homogenous treatment effect depends on how close to linear \(E[D_i|X_i]\) is. This is what we’ll explore via simulation
set.seed(92021)
N <- 10000
X_i <- rnorm(N) # One continuous covariate drawn from ~N(0,1)
# Find min and max of X
minX <- min(X_i)
maxX <- max(X_i)
# Draw untreated POs from ~N(0,10), treated POs from ~N(Y(0)+10,1)
Y_i0 <- rnorm(N,0,10)
Y_i1 <- Y_i0+10
#sapply(Y_i0,function(x) rnorm(1,10+x))
print(paste("ATE:",mean(Y_i1-Y_i0)))
## [1] "ATE: 10"
# F'n to plot the prob of treatment by X
plotD <- function(Dfn){ # Dfn is a function which assigns treatment prob to X
D <- sapply(X_i,Dfn) # Find treatment Probs
df <- data.frame("X"=X_i,
"Y0"=Y_i0,
"Y1"=Y_i1,
"D"=as.numeric(D),
"Y"=Y_i1*D+(1-D)*Y_i0) # Make a DF
out <- ggplot(data=df)+
geom_point(aes(x=X,y=D))
return(out)
}
# F'n to compute ATE given a treatment vector
getATE <- function(D){
df <- data.frame("X"=X_i,
"Y0"=Y_i0,
"Y1"=Y_i1,
"D"=D,
"Y"=Y_i1*D+(1-D)*Y_i0)
ATE_lin <- lm(Y~D+X,df) # Reg Y on X and D
return(summary(ATE_lin))
}
# First, lets try a nice linear form for E[D|X]
Dfn_lin <- function(x){
return((x-minX)/((1+(maxX-minX)/10)*maxX-(1+(maxX-minX)/10)*minX))
}
D_lin <- sapply(X_i, function(x) rbernoulli(1,Dfn_lin(x)))
plotD(Dfn_lin)
getATE(D_lin)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.568 -6.708 0.109 6.746 42.986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2005 0.1175 1.706 0.088 .
## DTRUE 9.8895 0.2231 44.321 <2e-16 ***
## X 0.1410 0.1002 1.407 0.160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.906 on 9997 degrees of freedom
## Multiple R-squared: 0.1699, Adjusted R-squared: 0.1697
## F-statistic: 1023 on 2 and 9997 DF, p-value: < 2.2e-16
# Now let's try something a bit less linear, like log
Dfn_log <- function(x){
return(log(x-(minX*1.1)+1)/log(maxX-(minX*1.1)+1))
}
D_log <- sapply(X_i, function(x) rbernoulli(1,Dfn_log(x)))
plotD(Dfn_log)
getATE(D_log)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.497 -6.721 0.087 6.740 42.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2788 0.1933 1.443 0.149
## DTRUE 9.8497 0.2275 43.295 <2e-16 ***
## X 0.1460 0.1008 1.448 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.906 on 9997 degrees of freedom
## Multiple R-squared: 0.1654, Adjusted R-squared: 0.1652
## F-statistic: 990.5 on 2 and 9997 DF, p-value: < 2.2e-16
# Even more non-linear, like an exponential
Dfn_exp <- function(x){
return(exp(x)/exp(maxX))
}
D_exp <- sapply(X_i, function(x) rbernoulli(1,Dfn_exp(x)))
plotD(Dfn_exp)
getATE(D_exp)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.561 -6.711 0.113 6.732 43.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1901 0.1006 1.890 0.0588 .
## DTRUE 9.2853 0.5969 15.556 <2e-16 ***
## X 0.1517 0.1000 1.516 0.1296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.906 on 9997 degrees of freedom
## Multiple R-squared: 0.02519, Adjusted R-squared: 0.02499
## F-statistic: 129.2 on 2 and 9997 DF, p-value: < 2.2e-16
# Now super non-linear: abs(sine)
Dfn_sin <- function(x){
return(abs(.98*sin(x*2)))
}
D_sin <- sapply(X_i, function(x) rbernoulli(1,Dfn_sin(x)))
plotD(Dfn_sin)
getATE(D_sin)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.438 -6.713 0.115 6.748 42.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.07251 0.16053 0.452 0.651
## DTRUE 10.15616 0.20402 49.781 <2e-16 ***
## X 0.13358 0.09878 1.352 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.906 on 9997 degrees of freedom
## Multiple R-squared: 0.1987, Adjusted R-squared: 0.1985
## F-statistic: 1239 on 2 and 9997 DF, p-value: < 2.2e-16
You can see that the non-linear forms of \(E[D_i|X_i]\) tend to show larger deviations of estimates from the true ATE of 10 (although sine did pretty well) but in general, all of the estimates are pretty close. This is in part due to the linearity of \(E[Y_i(0)|X_i]\) and \(E[Y_i(1)|X_i]\). Even if we have few obervations with a particular treatment status in parts of the covariate distribution, we can use information from other parts along with the imposed linearity to form nice counterfactuals. This starts to break down if we have non-linear CEFs for the potential outcomes and treatment effect heterogeneity. In the next example, both potential outcome CEFs are exponential in \(X_i\), and \(E[Y_i(0)|X_i]=-E[Y_i(1)|X_i]\), leading to larger treatment effects at higher values of \(X_i\).
set.seed(92021)
N <- 10000
X_i <- rnorm(N) # One continuous covariate drawn from ~N(0,1)
# Find min and max of X
minX <- min(X_i)
maxX <- max(X_i)
# Draw untreated PO errors from ~N(0,1)
#e_i0 <- rnorm(N)
#e_i1 <- rnorm(N)
# E[Y(1)|X] is exponential in X, E[Y(0)|X] is negative of that
Y_i0 <- -1*exp(X_i)#+e_i0
Y_i1 <- exp(X_i)#+e_i1
print(paste("ATE:",mean(Y_i1-Y_i0)))
## [1] "ATE: 3.29137271279047"
# First, lets try a nice linear form for E[D|X]
D_lin <- sapply(X_i, function(x) rbernoulli(1,Dfn_lin(x)))
getATE(D_lin)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.777 -0.697 0.419 0.655 51.841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.52761 0.02458 -62.15 <2e-16 ***
## DTRUE 3.71939 0.04636 80.22 <2e-16 ***
## X -0.48502 0.02090 -23.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.068 on 9997 degrees of freedom
## Multiple R-squared: 0.3955, Adjusted R-squared: 0.3954
## F-statistic: 3271 on 2 and 9997 DF, p-value: < 2.2e-16
# Now let's try something a bit less linear, like log
D_log <- sapply(X_i, function(x) rbernoulli(1,Dfn_log(x)))
getATE(D_log)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.778 -0.653 -0.397 0.391 46.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.80688 0.03695 -21.84 <2e-16 ***
## DTRUE 2.50368 0.04327 57.86 <2e-16 ***
## X 1.02534 0.01900 53.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.866 on 9997 degrees of freedom
## Multiple R-squared: 0.4401, Adjusted R-squared: 0.44
## F-statistic: 3929 on 2 and 9997 DF, p-value: < 2.2e-16
# Even more non-linear, like an exponential
D_exp <- sapply(X_i, function(x) rbernoulli(1,Dfn_exp(x)))
getATE(D_exp)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.474 -0.219 0.342 0.594 51.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.60961 0.01630 -98.75 <2e-16 ***
## DTRUE 7.14538 0.09242 77.31 <2e-16 ***
## X -1.35766 0.01624 -83.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.603 on 9997 degrees of freedom
## Multiple R-squared: 0.5238, Adjusted R-squared: 0.5237
## F-statistic: 5498 on 2 and 9997 DF, p-value: < 2.2e-16
# Now super non-linear: abs(sine)
D_sin <- sapply(X_i, function(x) rbernoulli(1,Dfn_sin(x)))
getATE(D_sin)
##
## Call:
## lm(formula = Y ~ D + X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.948 -0.855 -0.107 0.723 48.872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.63194 0.03390 -48.13 <2e-16 ***
## DTRUE 3.28530 0.04320 76.04 <2e-16 ***
## X 0.40210 0.02095 19.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.101 on 9997 degrees of freedom
## Multiple R-squared: 0.3811, Adjusted R-squared: 0.381
## F-statistic: 3078 on 2 and 9997 DF, p-value: < 2.2e-16
Here we see some real movement in the estimated ATEs. But interestingly, our highly non-linear sine CEF works quite well, even better than the linear CEF. What’s going on here? The answer lies in how much error there is to the CEFs \(E[Y_i|X_i]\) and \(E[D_i|X_i]\) from using a linear approximator and, roughly, how steep the linear approximation to \(E[D_i|X_i]\) is.
# Function to plot E[Y|D] and OLS approx
plot_EY <- function(D,Dfn){
ED <- sapply(X_i, Dfn)
df <- data.frame("X"=X_i,
"Y0"=Y_i0,
"Y1"=Y_i1,
"D"=D,
"ED"=ED,
"Y"=Y_i1*D+(1-D)*Y_i0)
# Run the linear reg Y on X
regY <- lm(Y~X,df)
df$Yfwl <- regY$residuals
df$Y_hat <- predict(regY)
# Run linear reg D on X
regD <- lm(D~X,df)
df$Dfwl <- regD$residuals
df$bias <- 2*(df$Yfwl-abs(df$Y))/nrow(df)
df$w <- df$Yfwl*(df$Dfwl*nrow(df)-2*sum(df$Dfwl^2))/(nrow(df)*sum(df$Dfwl^2))
pal <- brewer.pal(3,"Set2")
out <- ggplot(data=df)+
geom_point(aes(x=bias,y=w))+
geom_abline(aes(intercept=0,slope=-1))+
labs(x="E[Y] Bias",y="Weight Bias")
return(out)
}
plot_EY(D_lin,Dfn_lin)
plot_EY(D_log,Dfn_log)
plot_EY(D_exp,Dfn_exp)
plot_EY(D_sin,Dfn_sin)
Roughly, the bias attributable to a single observation can be broken into 1) how far the residualized \(Y_i\) is from the actual value of \(Y\) and 2) how far the weight placed on the observation is from the proper weight. In particular, we can write
\[\tau^{OLS}-\tau = \sum_{i=1}^N [\frac{2}{N} (\tilde{Y}_i-Y_i)+\frac{N\tilde{D_i}-2\sum_{i=1}^N\tilde{D}_i^2}{N\sum_{i=1}^N\tilde{D}_i^2}\tilde{Y}_i\]
These two sources of bias are plotted in the figures above. From the plot for our sine CEF, we see that a number of observations with negative weights offset the generally slightly larger weight bias. For the linear and exponential CEFs the bias is generally positive due to observations receiving too much weight. Finally, the logarithmic CEF has a number of observations with large negative weights, biasing the estimate downwards.