In this tutorial we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.
UPDATE: Many people have asked for the data used in this tutorial. To get the dataset used below (ecls.csv), please follow the instructions outlined here.
To examine the effect of going to Catholic school (“Treated”) versus public school (“Control”) on student achievement using matching we will go through the following steps:
In addition, before we implement a matching method, we’ll conduct the following analyses using the non-matched data:
Before we start, load a few packages and read in ecls.csv:
library(MatchIt)
library(dplyr)
library(ggplot2)
ecls <- read.csv("ecls.csv")
Here is some basic information about public and catholic school students in terms of math achievement. Note that we’re using students’ standardized math score (c5r2mtsc_std
) – with a mean of 0 and standard deviation of 1 – as the outcome variable of interest. The independent variable of interest is catholic
(1 = student went to catholic school; 0 = student went to public school).
ecls %>%
group_by(catholic) %>%
summarise(n_students = n(),
mean_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std) / sqrt(n_students))
## # A tibble: 2 × 4
## catholic n_students mean_math std_error
## <int> <int> <dbl> <dbl>
## 1 0 9568 -0.03059583 0.01038536
## 2 1 1510 0.19386817 0.02235282
Note that the outcome variable has been standardized (mean = 0, sd = 1). This is common in education research. The summary table above indicates that 3rd grade Catholic school students’ average math score is more than 20% of a standard deviation higher than that of public school students. This could have been calculated using the non-standardized outcome variable as follows:
ecls %>%
mutate(test = (c5r2mtsc - mean(c5r2mtsc)) / sd(c5r2mtsc)) %>% #this is how the math score is standardized
group_by(catholic) %>%
summarise(mean_math = mean(test))
## # A tibble: 2 × 2
## catholic mean_math
## <int> <dbl>
## 1 0 -0.03059583
## 2 1 0.19386817
The difference-in-means is statistically significant at conventional levels of confidence (as is also evident from the small standard error above):
with(ecls, t.test(c5r2mtsc_std ~ catholic))
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = -9.1069, df = 2214.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2727988 -0.1761292
## sample estimates:
## mean in group 0 mean in group 1
## -0.03059583 0.19386817
We’ll work with the following covariates for now:
race_white
: Is the student white (1) or not (0)?p5hmage
: Mother’s agew3income
: Family incomep5numpla
: Number of places the student has lived for at least 4 monthsw3momed_hsb
: Is the mother’s education level high-school or below (1) or some college or more (0)?Let’s calculate the mean for each covariate by the treatment status:
ecls_cov <- c('race_white', 'p5hmage', 'w3income', 'p5numpla', 'w3momed_hsb')
ecls %>%
group_by(catholic) %>%
select(one_of(ecls_cov)) %>%
summarise_all(funs(mean(., na.rm = T)))
## Adding missing grouping variables: `catholic`
## # A tibble: 2 × 6
## catholic race_white p5hmage w3income p5numpla w3momed_hsb
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.5561246 37.56097 54889.16 1.132669 0.4640918
## 2 1 0.7251656 39.57516 82074.30 1.092701 0.2272069
What do you see? Take a moment to reflect on what these differences suggest for the relationship of interest (that between Catholic schooling and student achievement).
We can carry out t-tests to evaluate whether these means are statistically distinguishable:
lapply(ecls_cov, function(v) {
t.test(ecls[, v] ~ ecls[, 'catholic'])
})
## [[1]]
##
## Welch Two Sample t-test
##
## data: ecls[, v] by ecls[, "catholic"]
## t = -13.453, df = 2143.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1936817 -0.1444003
## sample estimates:
## mean in group 0 mean in group 1
## 0.5561246 0.7251656
##
##
## [[2]]
##
## Welch Two Sample t-test
##
## data: ecls[, v] by ecls[, "catholic"]
## t = -12.665, df = 2186.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.326071 -1.702317
## sample estimates:
## mean in group 0 mean in group 1
## 37.56097 39.57516
##
##
## [[3]]
##
## Welch Two Sample t-test
##
## data: ecls[, v] by ecls[, "catholic"]
## t = -20.25, df = 1825.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -29818.10 -24552.18
## sample estimates:
## mean in group 0 mean in group 1
## 54889.16 82074.30
##
##
## [[4]]
##
## Welch Two Sample t-test
##
## data: ecls[, v] by ecls[, "catholic"]
## t = 4.2458, df = 2233.7, p-value = 2.267e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02150833 0.05842896
## sample estimates:
## mean in group 0 mean in group 1
## 1.132669 1.092701
##
##
## [[5]]
##
## Welch Two Sample t-test
##
## data: ecls[, v] by ecls[, "catholic"]
## t = 18.855, df = 2107.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2122471 0.2615226
## sample estimates:
## mean in group 0 mean in group 1
## 0.4640918 0.2272069
If you don’t understand the code for the t-tests above, you can also use:
with(ecls, t.test(race_white ~ catholic)) #(repeat for each covariate)
##
## Welch Two Sample t-test
##
## data: race_white by catholic
## t = -13.453, df = 2143.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1936817 -0.1444003
## sample estimates:
## mean in group 0 mean in group 1
## 0.5561246 0.7251656
We estimate the propensity score by running a logit model (probit also works) where the outcome variable is a binary variable indicating treatment status. What covariates should you include? For the matching to give you a causal estimate in the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes. I choose just a few covariates below—they are unlikely to capture all covariates that should be included. You’ll be asked to come up with a potentially better model on your own later.
ecls <- ecls %>% mutate(w3income_1k = w3income / 1000)
m_ps <- glm(catholic ~ race_white + w3income_1k + p5hmage + p5numpla + w3momed_hsb,
family = binomial(), data = ecls)
summary(m_ps)
##
## Call:
## glm(formula = catholic ~ race_white + w3income_1k + p5hmage +
## p5numpla + w3momed_hsb, family = binomial(), data = ecls)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1883 -0.6140 -0.4508 -0.3336 2.5659
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2125519 0.2379826 -13.499 < 2e-16 ***
## race_white 0.3145014 0.0700895 4.487 7.22e-06 ***
## w3income_1k 0.0073038 0.0006495 11.245 < 2e-16 ***
## p5hmage 0.0292168 0.0050771 5.755 8.69e-09 ***
## p5numpla -0.1439392 0.0912255 -1.578 0.115
## w3momed_hsb -0.6935868 0.0743207 -9.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7701.3 on 9266 degrees of freedom
## Residual deviance: 7168.8 on 9261 degrees of freedom
## (1811 observations deleted due to missingness)
## AIC: 7180.8
##
## Number of Fisher Scoring iterations: 5
Using this model, we can now calculate the propensity score for each student. It is simply the student’s predicted probability of being Treated, given the estimates from the logit model. Below, I calculate this propensity score using predict()
and create a dataframe that has the propensity score as well as the student’s actual treatment status.
prs_df <- data.frame(pr_score = predict(m_ps, type = "response"),
catholic = m_ps$model$catholic)
head(prs_df)
## pr_score catholic
## 1 0.2292928 0
## 2 0.1801360 0
## 4 0.2092957 0
## 5 0.2154022 1
## 6 0.3604931 0
## 7 0.1080608 0
After estimating the propensity score, it is useful to plot histograms of the estimated propensity scores by treatment status:
labs <- paste("Actual school type attended:", c("Catholic", "Public"))
prs_df %>%
mutate(catholic = ifelse(catholic == 1, labs[1], labs[2])) %>%
ggplot(aes(x = pr_score)) +
geom_histogram(color = "white") +
facet_wrap(~catholic) +
xlab("Probability of going to Catholic school") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A simple method for estimating the treatment effect of Catholic schooling is to restrict the sample to observations within the region of common support, and then to divide the sample within the region of common support into 5 quintiles, based on the estimated propensity score. Within each of these 5 quintiles, we can then estimate the mean difference in student achievement by treatment status. Rubin and others have argued that this is sufficient to eliminate 95% of the bias due to confounding of treatment status with a covariate.
However, most matching algorithms adopt slightly more complex methods. The method we use below is to find pairs of observations that have very similar propensity scores, but that differ in their treatment status. We use the package MatchIt
for this. This package estimates the propensity score in the background and then matches observations based on the method of choice (“nearest” in this case).
ecls_nomiss <- ecls %>% # MatchIt does not allow missing values
select(c5r2mtsc_std, catholic, one_of(ecls_cov)) %>%
na.omit()
mod_match <- matchit(catholic ~ race_white + w3income + p5hmage + p5numpla + w3momed_hsb,
method = "nearest", data = ecls_nomiss)
We can get some information about how successful the matching was using summary(mod_match)
and plot(mod_match)
(try this yourself).
To create a dataframe containing only the matched observations, use the match.data()
function:
dta_m <- match.data(mod_match)
dim(dta_m)
## [1] 2704 9
Note that the final dataset is smaller than the original: it contains 2704 observations, meaning that 1352 pairs of treated and control observations were matched. Also note that the final dataset contains a variable called distance
, which is the propensity score.
We’ll do three things to assess covariate balance in the matched sample:
It is useful to plot the mean of each covariate against the estimated propensity score, separately by treatment status. If matching is done well, the treatment and control groups will have (near) identical means of each covariate at each value of the propensity score.
Below is an example using the four covariates in our model. Here I use a loess smoother to estimate the mean of each covariate, by treatment status, at each value of the propensity score.
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
if (variable == 'w3income') dta$variable <- dta$variable / 10^3
dta$catholic <- as.factor(dta$catholic)
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = catholic)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
library(gridExtra)
grid.arrange(
fn_bal(dta_m, "w3income"),
fn_bal(dta_m, "p5numpla") + theme(legend.position = "none"),
fn_bal(dta_m, "p5hmage"),
fn_bal(dta_m, "w3momed_hsb") + theme(legend.position = "none"),
fn_bal(dta_m, "race_white"),
nrow = 3, widths = c(1, 0.8)
)
The means below indicate that we have attained a high degree of balance on the five covariates included in the model.
dta_m %>%
group_by(catholic) %>%
select(one_of(ecls_cov)) %>%
summarise_all(funs(mean))
## Adding missing grouping variables: `catholic`
## # A tibble: 2 × 6
## catholic race_white p5hmage w3income p5numpla w3momed_hsb
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.7470414 39.5503 81403.99 1.076183 0.2152367
## 2 1 0.7411243 39.5932 82568.94 1.091716 0.2233728
You can test this more formally using t-tests. Ideally, we should not be able to reject the null hypothesis of no mean difference for each covariate:
lapply(ecls_cov, function(v) {
t.test(dta_m[, v] ~ dta_m$catholic)
})
## [[1]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$catholic
## t = 0.35243, df = 2701.8, p-value = 0.7245
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02700440 0.03883872
## sample estimates:
## mean in group 0 mean in group 1
## 0.7470414 0.7411243
##
##
## [[2]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$catholic
## t = -0.21331, df = 2702, p-value = 0.8311
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4372485 0.3514496
## sample estimates:
## mean in group 0 mean in group 1
## 39.5503 39.5932
##
##
## [[3]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$catholic
## t = -0.64787, df = 2701.9, p-value = 0.5171
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4690.731 2360.845
## sample estimates:
## mean in group 0 mean in group 1
## 81403.99 82568.94
##
##
## [[4]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$catholic
## t = -1.339, df = 2699.5, p-value = 0.1807
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.038278301 0.007213213
## sample estimates:
## mean in group 0 mean in group 1
## 1.076183 1.091716
##
##
## [[5]]
##
## Welch Two Sample t-test
##
## data: dta_m[, v] by dta_m$catholic
## t = -0.51108, df = 2701.5, p-value = 0.6093
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03935185 0.02307966
## sample estimates:
## mean in group 0 mean in group 1
## 0.2152367 0.2233728
As a measure of the average imbalance, we can calculate the following:
\(\bar{\left|\frac{\beta}{\sigma}\right|} = \frac{1}{k}\sum_x \frac{|\beta_x|}{\sigma_x}\)
where \(\beta_x\) is the difference between the covariate means in the treated and control groups in the matched sample. An average absolute standardized difference that is close to 0 is preferable, since that indicates small differences between the control and treatment groups in the matched sample.
Try to implement a function that calculates the absolute standardized difference for any covariate in the matched sample. Then take the average for all the covariates.
Estimating the treatment effect is simple once we have a matched sample that we are happy with. We can use a t-test:
with(dta_m, t.test(c5r2mtsc_std ~ catholic))
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = 4.1788, df = 2670.1, p-value = 3.025e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.07834911 0.21688388
## sample estimates:
## mean in group 0 mean in group 1
## 0.3572844 0.2096679
Or we can use OLS with or without covariates:
lm_treat1 <- lm(c5r2mtsc_std ~ catholic, data = dta_m)
summary(lm_treat1)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = dta_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5089 -0.5754 0.0431 0.6167 3.0764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35728 0.02498 14.304 < 2e-16 ***
## catholic -0.14762 0.03533 -4.179 3.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9185 on 2702 degrees of freedom
## Multiple R-squared: 0.006421, Adjusted R-squared: 0.006054
## F-statistic: 17.46 on 1 and 2702 DF, p-value: 3.024e-05
lm_treat2 <- lm(c5r2mtsc_std ~ catholic + race_white + p5hmage +
I(w3income / 10^3) + p5numpla + w3momed_hsb, data = dta_m)
summary(lm_treat2)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + race_white + p5hmage +
## I(w3income/10^3) + p5numpla + w3momed_hsb, data = dta_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6234 -0.5464 0.0336 0.6106 3.1875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4973086 0.1490892 -3.336 0.000863 ***
## catholic -0.1452595 0.0333418 -4.357 1.37e-05 ***
## race_white 0.2843161 0.0387899 7.330 3.03e-13 ***
## p5hmage 0.0151427 0.0032496 4.660 3.32e-06 ***
## I(w3income/10^3) 0.0030040 0.0003774 7.959 2.53e-15 ***
## p5numpla -0.1106781 0.0555234 -1.993 0.046323 *
## w3momed_hsb -0.3815903 0.0415457 -9.185 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8664 on 2697 degrees of freedom
## Multiple R-squared: 0.1175, Adjusted R-squared: 0.1155
## F-statistic: 59.85 on 6 and 2697 DF, p-value: < 2.2e-16
Interpret these estimates!
In groups of 2 or 3, come up with a different model to estimate the propensity score. Evaluate whether it is better or worse than the model we have worked with so far by comparing the average absolute standardized difference (average imbalance) from the two models. Once you have settled on a model, report the following:
If we have time, repeat the exercise using a different matching algorithm than nearest-neighbor.
The code used in this tutorial is available here.