In this tutorial we’ll use the ANES data we used in Tutorial 2. Read in the file and clean it like we did in the Preliminaries section of that tutorial.
You should then have:
dim(anes)
## [1] 49760 12
Also make sure you load the following packages:
libs <- c("plyr", "dplyr", "ggplot2", "arm")
sapply(libs, require, character.only = TRUE)
## plyr dplyr ggplot2 arm
## TRUE TRUE TRUE TRUE
The only package we haven’t used before is arm
. It allows us to display OLS output more cleanly and to extract standard errors.
Make note of two graphing approaches of OLS results below:
Let’s start by a simple model that predicts democratic feeling ratings given the respondent’s gender.
levels(anes$gender) <- c(NA, "Male", "Female")
lm1 <- lm(demtherm ~ gender, data = anes)
require(arm) #for display function
display(lm1)
## lm(formula = demtherm ~ gender, data = anes)
## coef.est coef.se
## (Intercept) 58.99 0.22
## genderFemale 5.65 0.29
## ---
## n = 26405, k = 2
## residual sd = 23.74, R-Squared = 0.01
How do we interpret the output? Compare the output with:
ddply(anes, .(gender), summarise, mean(demtherm, na.rm = T))
## gender mean(demtherm, na.rm = T)
## 1 Male 58.99
## 2 Female 64.64
## 3 <NA> NaN
Does this make sense?
Graphing can aid in the interpretation of OLS estimates. The benefits to graphing aren’t obvious in this simple model, but let’s use it for illustrative purposes. We’ll use two approaches.
We can extract the coefficients from the model using coef()
and the standard errors using se.coef()
. Using these we can calculate confidence intervals and graph the results. We can also use confint()
to find the confidence intervals directly.
ci1 <- coef(lm1)[2] + c(-1, 1) * se.coef(lm1)[2] * 1.96 #or equivalently:
ci1 <- confint(lm1, level = 0.95)[2, ]
est1 <- data.frame(est = coef(lm1)[2],
lb = ci1[1],
ub = ci1[2],
model = "Model 1")
ggplot(est1, aes(x = model, y = est)) +
geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(yintercept = 0, lty = 2, color = "red") +
xlab("") +
ylab("Female Democrat Thermometer Estimate (Relative to Males)")
Here’s a different approach using R’s predict()
function. Before you use this command, however, ask yourself the following:
pred1 <- predict(lm1,
newdata = data.frame(gender = c("Male", "Female")),
se.fit = T,
interval = "confidence")
# The result is a list with estimates, 95% CIs, and other info:
pred1
## $fit
## fit lwr upr
## 1 58.99 58.56 59.42
## 2 64.64 64.25 65.02
##
## $se.fit
## 1 2
## 0.2190 0.1961
##
## $df
## [1] 26403
##
## $residual.scale
## [1] 23.74
# Extract part of the list with estimates and confidence interval, and add gender info:
pred1 <- data.frame(pred1$fit, gender = c("Male", "Female"))
ggplot(pred1, aes(x = gender, y = fit, color = gender)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
xlab("") +
ylab("Predicted Democrat Thermometer")
The newdata
argument of the predict()
function lets you supply the values for which you want a prediction. You have to use the same variable name(s) as those of the variable(s) included in lm()
.
Let’s add income and age to the model. Income in particular may be a confounder here, since women have lower income and income is negatively correlated with support for the democratic party.
levels(anes$income)[1] <- NA
levels(anes$income) <- with(anes, substr(levels(income), 4, nchar(levels(income))))
anes$age[anes$age == 0] <- NA
lm2 <- lm(demtherm ~ gender + age + income, data = anes)
display(lm2)
## lm(formula = demtherm ~ gender + age + income, data = anes)
## coef.est coef.se
## (Intercept) 64.00 0.59
## genderFemale 3.89 0.30
## age 0.08 0.01
## income17 to 33 percentile -3.99 0.51
## income34 to 67 percentile -8.50 0.45
## income68 to 95 percentile -12.48 0.47
## income96 to 100 percentile -17.94 0.74
## ---
## n = 23686, k = 7
## residual sd = 22.92, R-Squared = 0.06
** Exercise**: Interpret each of the coefficient estimates from this model.
Again, we can graph the estimate on female. For comparison, let’s put it next to the estimate from the bivariate model above.
ci2 <- confint(lm2, level = 0.95)[2, ]
est2 <- data.frame(est = coef(lm2)[2],
lb = ci2[1],
ub = ci2[2],
model = "Model 2")
est <- rbind(est1, est2) #combine with estimates from first model
ggplot(est, aes(x = model, y = est)) +
geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(yintercept = 0, lty = 2, color = "red") +
xlab("") +
ylab("Female Democrat Thermometer Estimate (Relative to Males)")
We can also graph the predicted values for males and females for “average” values of the covariates.
Mean age is 46 years:
mean(anes$age, na.rm = T)
## [1] 45.84
The most common income category is “34 to 67 percentile”:
prop.table(table(anes$income))
##
## 0 to 16 percentile 17 to 33 percentile 34 to 67 percentile
## 0.17108 0.16785 0.32710
## 68 to 95 percentile 96 to 100 percentile
## 0.27925 0.05473
So let’s use these (constant) values for each of our covariates, while varying gender.
** Exercise**: Before you use the predict()
function, calculate by hand the predicted values for males and females at the average values of age
and income
.
Now let’s graph the predicted values:
newdta <- data.frame(gender = c("Male", "Female"),
age = rep(mean(anes$age, na.rm = T), 2),
income = rep("34 to 67 percentile", 2))
pred2 <- predict(lm2,
newdata = newdta,
se.fit = T,
interval = "confidence")
pred2 <- data.frame(pred2$fit, gender = c("Male", "Female"))
ggplot(pred2, aes(x = gender, y = fit, color = gender)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.1) +
xlab("\nNote: Estimates condition on income and age") +
ylab("Predicted Democrat Thermometer")
The previous models group respondents from all ANES surveys starting in 1978. (The democratic rating was not asked before 1978, and lm()
automatically excludes missing observations.)
Let’s examine the relationship between gender and the democratic rating over time, while still conditioning on age and income.
First run a separate regression for each year starting in 1978. I use dlply()
here: it takes a data frame, splits it by year, and returns a list with the model estimates for each year:
anes_sub <- subset(anes, year >= 1978 & year != 2002)
lm3 <- dlply(anes_sub, .(year), function(x) {
lm(demtherm ~ gender + age + income, data = x)
})
display(lm3$"1990") #estimates for year 1990
## lm(formula = demtherm ~ gender + age + income, data = x)
## coef.est coef.se
## (Intercept) 61.10 1.98
## genderFemale 4.54 1.05
## age 0.11 0.03
## income17 to 33 percentile -1.16 1.76
## income34 to 67 percentile -4.33 1.55
## income68 to 95 percentile -8.66 1.56
## income96 to 100 percentile -20.20 2.73
## ---
## n = 1792, k = 7
## residual sd = 21.68, R-Squared = 0.07
Now grab the estimate on gender as well as the associated 95% confidence interval for each year. Now I use ldply
: it takes a list and returns a data frame:
est_time <- ldply(lm3, function(x) {
c(coef(x)[2], confint(x)[2, ])
})
names(est_time) <- c("year", "est", "lb", "ub")
est_time
## year est lb ub
## 1 1978 3.485 1.5872 5.382
## 2 1980 1.668 -0.7725 4.108
## 3 1982 3.407 1.0232 5.792
## 4 1984 2.177 0.2555 4.098
## 5 1986 3.763 1.7877 5.739
## 6 1988 4.140 1.9031 6.376
## 7 1990 4.544 2.4844 6.603
## 8 1992 4.107 2.2672 5.947
## 9 1994 3.367 1.1190 5.615
## 10 1996 4.329 1.9330 6.725
## 11 1998 5.192 2.5626 7.822
## 12 2000 4.728 2.2240 7.231
## 13 2004 3.246 0.3541 6.138
## 14 2008 5.180 3.0823 7.278
Now graph these estimates:
ggplot(est_time, aes(x = year, y = est)) +
geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.3) +
geom_hline(yintercept = 0, lty = 2, color = "red") +
xlab("") +
ylab("Female Democrat Thermometer Estimate (Relative to Males)")
Again, we can graph the predicted values instead:
pred_time <- ldply(lm3, function(x) {
predict(x, newdata = newdta, se.fit = T, interval = "confidence")$fit
})
pred_time$gender <- rep(c("Male", "Female"), length(pred_time)/2)
ggplot(pred_time, aes(x = year, y = fit, color = gender)) +
geom_point() +
geom_line() +
geom_ribbon(aes(ymax = upr, ymin = lwr, fill = gender),
alpha = 0.15, color = NA) +
xlab("\nNote: Income and age held at average values") +
ylab("Predicted Democrat Thermometer")
Graphing can dramatically improve the interpretability of interaction terms in OLS models. Say, for example, we interacted gender and age to study whether democratic rating by gender changes with age:
lm4 <- lm(demtherm ~ gender + age + income + age:gender, data = anes)
display(lm4)
## lm(formula = demtherm ~ gender + age + income + age:gender, data = anes)
## coef.est coef.se
## (Intercept) 62.17 0.73
## genderFemale 7.26 0.84
## age 0.12 0.01
## income17 to 33 percentile -4.05 0.51
## income34 to 67 percentile -8.59 0.45
## income68 to 95 percentile -12.59 0.47
## income96 to 100 percentile -18.07 0.74
## genderFemale:age -0.08 0.02
## ---
## n = 23686, k = 8
## residual sd = 22.91, R-Squared = 0.06
Try interpreting the interaction term here?
Graphing makes this simpler:
agerange <- 18:85
newdta <- data.frame(expand.grid(gender = c("Male", "Female"),
age = agerange),
income = "34 to 67 percentile")
pred4 <- predict(lm4,
newdata = newdta,
se.fit = T,
interval = "confidence")
pred4 <- data.frame(cbind(pred4$fit, newdta))
head(pred4)
## fit lwr upr gender age income
## 1 55.80 54.90 56.69 Male 18 34 to 67 percentile
## 2 61.70 60.88 62.52 Female 18 34 to 67 percentile
## 3 55.92 55.04 56.80 Male 19 34 to 67 percentile
## 4 61.75 60.94 62.55 Female 19 34 to 67 percentile
## 5 56.04 55.19 56.90 Male 20 34 to 67 percentile
## 6 61.79 61.00 62.58 Female 20 34 to 67 percentile
ggplot(pred4, aes(x = age, y = fit, color = gender)) +
geom_point() +
geom_line(aes(y = lwr), lty = 3) +
geom_line(aes(y = upr), lty = 3) +
ylab("Democrat Thermometer Ratings") +
xlab("Age") +
theme_bw()
We make a pretty strong linearity assumption on age (by gender) here. We can relax this by creating a factor variable that groups respondents into smaller age categories:
anes_sub <- subset(anes, age %in% 20:70)
anes_sub$age_cat <- cut(anes_sub$age, breaks = seq(20, 70, by = 5))
with(anes_sub, head(data.frame(age, age_cat), 10))
## age age_cat
## 1 25 (20,25]
## 2 33 (30,35]
## 3 26 (25,30]
## 4 63 (60,65]
## 5 66 (65,70]
## 6 48 (45,50]
## 7 70 (65,70]
## 8 25 (20,25]
## 9 30 (25,30]
## 10 35 (30,35]
lm5 <- lm(demtherm ~ gender + income + age_cat + + age_cat:gender,
data = anes_sub)
display(lm5)
## lm(formula = demtherm ~ gender + income + age_cat + +age_cat:gender,
## data = anes_sub)
## coef.est coef.se
## (Intercept) 64.70 0.81
## genderFemale 3.96 0.97
## income17 to 33 percentile -3.38 0.58
## income34 to 67 percentile -8.42 0.51
## income68 to 95 percentile -12.59 0.53
## income96 to 100 percentile -18.53 0.79
## age_cat(25,30] -0.60 0.98
## age_cat(30,35] 2.25 0.96
## age_cat(35,40] 2.04 0.97
## age_cat(40,45] 1.44 1.01
## age_cat(45,50] 3.66 1.06
## age_cat(50,55] 4.51 1.08
## age_cat(55,60] 6.03 1.11
## age_cat(60,65] 5.98 1.14
## age_cat(65,70] 4.93 1.23
## genderFemale:age_cat(25,30] 2.54 1.32
## genderFemale:age_cat(30,35] 0.06 1.29
## genderFemale:age_cat(35,40] 0.13 1.31
## genderFemale:age_cat(40,45] 2.51 1.38
## genderFemale:age_cat(45,50] 0.64 1.44
## genderFemale:age_cat(50,55] -1.11 1.47
## genderFemale:age_cat(55,60] -0.65 1.50
## genderFemale:age_cat(60,65] -1.48 1.54
## genderFemale:age_cat(65,70] -1.44 1.63
## ---
## n = 20426, k = 24
## residual sd = 22.71, R-Squared = 0.06
Let’s again graph the predicted values (holding income constant):
agerange <- sort(as.character(unique(na.omit(anes_sub$age_cat))))
newdta <- data.frame(expand.grid(gender = c("Male", "Female"),
age_cat = agerange),
income = "34 to 67 percentile")
pred5 <- predict(lm5,
newdata = newdta,
se.fit = T,
interval = "confidence")
pred5 <- data.frame(cbind(pred5$fit, newdta))
ggplot(pred5, aes(x = age_cat, y = fit, color = gender)) +
geom_point() +
geom_line(aes(y = lwr, group = gender), lty = 3) +
geom_line(aes(y = upr, group = gender), lty = 3) +
geom_line(aes(y = fit, group = gender), lty = 1, alpha = 0.5) +
ylab("Democrat Thermometer Ratings") +
xlab("Age Category") +
theme_bw()
The code used in this tutorial is available here.