In this tutorial we’ll use the ANES cumulative file available here and on Coursework. Before we begin, let’s take care of some preliminary tasks, including loading the packages we need and importing this file.
# Load packages (install before if haven't already)
libs <- c("foreign", "plyr", "dplyr", "ggplot2")
sapply(libs, require, character.only = TRUE)
## foreign plyr dplyr ggplot2
## TRUE TRUE TRUE TRUE
# Import the anescum.dta file
setwd("~/dropbox/155/tutorial2")
system("unzip ANEScumul_data_codebook.zip") #unzip (using batch command)
anes <- read.dta("anescum.dta", warn.missing.labels = F) #read file
system("rm -r anescum* __M*") #delete unzipped files (using batch command)
# Keep only some variables and rename them
# see https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
# for the combined select/rename function used here
anes <- anes %>%
select(year = VCF0004,
age = VCF0101,
gender = VCF0104,
race = VCF0106,
edu = VCF0140,
south = VCF0113,
income = VCF0114,
partyid = VCF0301,
interest = VCF0310,
govtrust = VCF0604,
abortion = VCF0838,
demtherm = VCF0218)
To learn more about the variables we kept, consult the ANES codebook included in the .zip file above.
You can learn more about using batch commands with system()
here. These commands may work slightly different in Windows.
Tutorial 1 focused heavily on data manipulation. It is worth spending a little more time on this. In particular, let’s go over how to recode factor variables into numeric variables.
The abortion variable is a factor variable with 6 categories, 4 of which are meaningful for our purposes:
class(anes$abortion)
## [1] "factor"
summary(anes$abortion)
## 0. NA; no Post IW; version NEW (2008)
## 1594
## 1. By law, abortion should never be permitted
## 2757
## 2. The law should permit abortion only in case of rape,
## 6726
## 3. The law should permit abortion for reasons other than
## 3710
## 4. By law, a woman should always be able to obtain an
## 8841
## 9. DK; other
## 473
## NA's
## 25659
Say we want to create a dummy variable that equals 1 if the respondent believes women should always be able to obtain abortion, and 0 otherwise (coding categories 0 and 9 as NA). Here are three ways of doing this:
# Option 1 (using ifelse)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion == "1. By law, abortion should never be permitted" |
abortion == "2. The law should permit abortion only in case of rape," |
abortion == "3. The law should permit abortion for reasons other than", 0, NA)))
table(anes$abortionD)
##
## 0 1
## 13193 8841
# Option 2 (using ifelse and %in%)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion %in% c("1. By law, abortion should never be permitted",
"2. The law should permit abortion only in case of rape,",
"3. The law should permit abortion for reasons other than"), 0, NA)))
table(anes$abortionD)
##
## 0 1
## 13193 8841
# Option 3 (recode factor levels, then convert to numeric)
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(as.character(anes$abortionD)) #convert to numeric
table(anes$abortionD)
##
## 0 1
## 13193 8841
Options 1 and 2 both use ifelse()
, which turns out to be quite cumbersome here because of the long labels. Option 3 requires less code, and may therefore be preferred.
However, when using Option 3, make sure you correctly convert the variable from factor to numeric. If you want to convert numbers stored as factor levels in R
to numeric values, you need to first convert them to character, then convert them to numeric values. Above, this is done using as.numeric(as.character(anes$abortionD))
.
If we were to convert the variable from factor to numeric without the intermediate step of converting them to character the following happens:
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(anes$abortionD) #FAIL
table(anes$abortionD)
##
## 1 2
## 13193 8841
R
converts the numbers to ‘1’ and ‘2’ instead of ‘0’ and ‘1’. This has to do with how R
stores factor levels internally. If you want to convert a factor variable to numeric, always remember to convert factors using as.numeric(as.character(var))
where var
is your variable of interest.
incomeD
which recodes income
in the anes
data frame into a (numeric) dummy variable that equals 1 if the respondent’s income is in the 68th percentile or above and 0 otherwise.Let’s now begin some basic data analysis. As part of this, it is important to get a sense for what variables look like separately (their univariate distributions). The univariate descriptive statistics you produce will depend on whether the variable is categorical or numeric.
Let’s first recode the 7-point party ID variable to a 3-point party ID variable:
table(anes$partyid)
##
## 0. DK; NA; other; refused to answer; no Pre IW
## 968
## 1. Strong Democrat
## 9320
## 2. Weak Democrat
## 10390
## 3. Independent - Democrat
## 5649
## 4. Independent - Independent
## 5617
## 5. Independent - Republican
## 4772
## 6. Weak Republican
## 6790
## 7. Strong Republican
## 5592
anes$partyid3 <- anes$partyid
levels(anes$partyid3) <- c(NA, "Democrat", "Democrat", "Democrat",
"Independent", "Republican", "Republican",
"Republican")
table(anes$partyid3)
##
## Democrat Independent Republican
## 25359 5617 17154
As is already apparent, table()
is a useful function for getting the distribution of a categorical variable. To get the proportion of respondents in each category, you can add prop.table()
:
prop.table(table(anes$partyid3))
##
## Democrat Independent Republican
## 0.5269 0.1167 0.3564
We’ll use the package ggplot2
for most of our graphing (see below). It produces high-quality, customizable graphs for publication. However, if you want to quickly examine the distribution of a categorical variable, you can use barplot()
in combination with table()
and prop.table()
:
barplot(table(anes$partyid3))
barplot(prop.table(table(anes$partyid3)))
There are many commands that will help you learn about the distribution of a variable—e.g., mean()
, median()
, min()
, max()
, and sd()
. Remember to include na.rm = T
if the variable includes NA
values (though also beware of the biases missing data may introduce). Here are some examples, using the demtherm
variable (a feeling thermometer for the democratic party).
# First recode the demtherm variable (values 98 and 99 indicate missing data)
anes$demtherm <- ifelse(anes$demtherm %in% 98:99, NA, identity(anes$demtherm))
# Find a few different descriptive statistics
mean(anes$demtherm, na.rm = T)
## [1] 60.34
median(anes$demtherm, na.rm = T)
## [1] 60
sd(anes$demtherm, na.rm = T)
## [1] 23.07
summary()
provides many of these statistics in one command:
summary(anes$demtherm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 50 60 60 75 97 24599
To find specific percentiles of a numeric variable, use quantile()
:
quantile(anes$demtherm, probs = c(0.1, 0.5, 0.9), na.rm = T) #10th, 50th, and 90th percentiles
## 10% 50% 90%
## 30 60 95
quantile(anes$demtherm, probs = seq(0, 1, by = 0.05), na.rm = T) #percentiles in increments of 5
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70%
## 0 15 30 40 40 50 50 50 50 60 60 60 70 70 70
## 75% 80% 85% 90% 95% 100%
## 75 85 85 95 97 97
Specify the percentiles you want in a vector using the option probs
in the function.
We can quickly get a sense for the distribution of a continuous variable by plotting a histogram of it (though use ggplot
for your papers):
hist(anes$demtherm)
Dummy variables are a special case of numeric, categorical variables. Because they are so useful, it is worth going over what taking the mean of a dummy variable means. For example, interpret the mean of the abortion dummy variable that we created above:
anes$abortionD <- anes$abortion
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)
anes$abortionD <- as.numeric(as.character(anes$abortionD))
mean(anes$abortionD, na.rm = T)
## [1] 0.4012
interest
variable (a categorical variable measuring interest in politics). More specifically, find the number and proportion of observations for each category of the variable.demtherm
or abortionD
, which we examined above): mean, median, standard deviation, min and max value, and the 5th and 95th percentiles.Examining the relationship between two (or more) variables is usually more interesting. Basic bivariate descriptive statistics come in the form of conditional means, crosstabs, and graphing. (Graphing is covered in a different section below.) What type of descriptive statistics you provide depends on the nature of your variables.
Let’s examine the relationship between gender and party identification. First clean the gender variable:
levels(anes$gender) <- c(NA, "Male", "Female")
table(anes$gender)
##
## Male Female
## 22017 27640
Here’s a simple crosstab:
table(anes$gender, anes$partyid3)
##
## Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
with(anes, table(gender, partyid3)) #same as above
## partyid3
## gender Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
prop.table()
with the correct margin specification is more interesting:
t <- with(anes, table(gender, partyid3))
prop.table(t, margin = 1)
## partyid3
## gender Democrat Independent Republican
## Male 0.5039 0.1211 0.3750
## Female 0.5453 0.1132 0.3415
Compare with:
prop.table(t)
## partyid3
## gender Democrat Independent Republican
## Male 0.22445 0.05394 0.16703
## Female 0.30243 0.06277 0.18938
prop.table(t, margin = 2)
## partyid3
## gender Democrat Independent Republican
## Male 0.4260 0.4622 0.4686
## Female 0.5740 0.5378 0.5314
What is the difference between prop.table(t)
, prop.table(t, margin = 1)
, and prop.table(t, margin = 2)
? Make sure you can interpret the output.
An alternative way to find bivariate relationships when you have two categorical variables is to recode your proposed dependent variable to a dummy variable and use ddply()
from package plyr
. (I personally use this approach much more frequently than prop.table()
.)
anes$democrat <- anes$partyid3
levels(anes$democrat) <- c(1, 0, 0)
anes$democrat <- as.numeric(as.character(anes$democrat))
ddply(anes, .(gender), summarise, mean = mean(democrat, na.rm = T))
## gender mean
## 1 Male 0.5039
## 2 Female 0.5453
## 3 <NA> NaN
In the code above I first created a dummy variable called democrat
that equals 1 if the respondent (weakly) identifies with the democratic party and 0 otherwise. I then found the mean of this variable by gender using ddply()
.
Notes about ddply
, a very useful function:
anes
.summarise
when we want to find a summary statistic.mean()
. We take the mean of the dependent variable.We can specify more than one function in the last argument:
ddply(anes, .(gender), summarise,
mean = mean(democrat, na.rm = T),
median = median(democrat, na.rm = T))
## gender mean median
## 1 Male 0.5039 1
## 2 Female 0.5453 1
## 3 <NA> NaN NA
We can also find the mean by more than one variable. For example, we can find mean democratic identification by gender and year (say for years 1970-2000 to reduce the output):
anes_sub <- subset(anes, year %in% 1970:1990)
ddply(anes_sub, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
## year gender mean
## 1 1970 Male 0.5437
## 2 1970 Female 0.5447
## 3 1972 Male 0.4845
## 4 1972 Female 0.5336
## 5 1974 Male 0.4970
## 6 1974 Female 0.5278
## 7 1976 Male 0.4941
## 8 1976 Female 0.5230
## 9 1978 Male 0.5215
## 10 1978 Female 0.5457
## 11 1980 Male 0.4870
## 12 1980 Female 0.5479
## 13 1982 Male 0.5032
## 14 1982 Female 0.5912
## 15 1984 Male 0.4549
## 16 1984 Female 0.4960
## 17 1986 Male 0.4730
## 18 1986 Female 0.5288
## 19 1988 Male 0.4263
## 20 1988 Female 0.5043
## 21 1990 Male 0.4966
## 22 1990 Female 0.5322
When we have a categorical independent variable and a continuous dependent variable, finding conditional means using ddply()
again is useful. Take for example the relationship between income and the democratic feeling thermometer:
ddply(anes, .(income), summarise, mean = mean(demtherm, na.rm = T),
median = median(demtherm, na.rm = T))
## income mean median
## 1 0. DK; NA; refused to answer; no Pre IW 60.17 60
## 2 1. 0 to 16 percentile 67.67 70
## 3 2. 17 to 33 percentile 64.58 65
## 4 3. 34 to 67 percentile 59.88 60
## 5 4. 68 to 95 percentile 55.92 60
## 6 5. 96 to 100 percentile 50.84 50
## 7 <NA> NaN NA
The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below). As a complement, you may want to find the Pearson correlation between the two variables. Let’s find the correlation between age
and demtherm
(after fixing age
):
sort(unique(anes$age)) #age variable has 0-year olds
## [1] 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [24] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [47] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [70] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
anes$age[anes$age == 0] <- NA #get rid of 0-year olds
# Pearson correlation
cor(anes$age, anes$demtherm, use = "complete.obs")
## [1] 0.08352
Create a crosstab of south
and the 3-point party ID variable, separately for 1952 and 2008. (Recode the 7-point party ID variable if you haven’t already.) The crosstab should be able to tell you to what extent people in the south are Democrat, Independent, and Republican (and likewise for the non-south). How did this change between 1952 and 2008?
Now create a Democrat dummy variable from the party ID variable. The variable should equal 1 if the respondent (weakly) identifies with the Democratic party and 0 if the respondent is Republican or (purely) Independent. Find the mean of this variable for people in the south and non-south using ddply()
, again for years 1952 and 2008. Compare with the results from Exercise 1.
Use ddply()
to find mean Democratic identification, by South and year—that is, by whether the respondent lives in the south for every year in the data set. Use the dummy variable from Exercise 2.
Use ddply()
to find both mean Democratic identification and the standard deviation of this variable, by South and year. Use ddply()
only once.
Let’s get started with ggplot2
, a very powerful graphic package for R
. What follows is not a comprehensive survey of every possible graphing technique with ggplot2
—rather it provides three examples aimed to get you started using the package. See the end of this section for more resources.
Say we wanted to plot the proportion of ANES respondents who say they identify with the Democratic party, separately for men and women and for each year in the dataset.
First, let’s use ddply()
, where the variables to split the data by are year
and gender
, and where the democrat
dummy variable we created above is the dependent variable of interest:
t <- ddply(anes, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
tail(t)
## year gender mean
## 55 2002 Male 0.4501
## 56 2002 Female 0.5025
## 57 2004 Male 0.4462
## 58 2004 Female 0.5369
## 59 2008 Male 0.5520
## 60 2008 Female 0.6336
Now create a very basic plot with year on the x-axis, proportion who identify with the democratic party on the y-axis, and a color scheme that differentiates women from men:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point()
The code above consists of two functions added together using +
:
ggplot()
specifies
t
),x
, y
, and color
) specified inside aes()
. Note that these attributes reference different column names in t
.geom_point()
tells R
to add points.The plot can be improved by adding lines:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line()
Change x- and y-labels and add a black and white theme:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line() +
xlab("") +
ylab("Proportion Identifying as Democrat") +
theme_bw()
Change the legend title, points/lines colors, and x-axis breaks:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line() +
xlab("") +
ylab("Proportion Identifying as Democrat") +
theme_bw() +
scale_colour_manual("Respondent's\nGender", values = c("black", "red")) +
scale_x_continuous(breaks = seq(1950, 2010, by = 10))
Basic density plot of demtherm
by gender:
ggplot(anes, aes(demtherm, fill = gender)) +
geom_density()
Change the sensitivity of the density estimates (adjust = 2
), make fill colors transparent (alpha = 0.3
), and remove the black color outlines (colour = NA
):
ggplot(anes, aes(demtherm, fill = gender)) +
geom_density(alpha = 0.3, adjust = 2, colour = NA)
Add a few more adjustments:
ggplot(anes, aes(demtherm, fill = gender)) +
geom_density(alpha = 0.3, adjust = 2, colour = NA) +
xlab("Democratic Thermometer") +
ylab("Density") +
scale_fill_manual(values = c("black", "darkblue")) +
theme_bw()
Same density plot for eight years in the data set using facet_wrap(~year)
:
years <- seq(1980, 2008, by = 4)
anes_sub <- subset(anes, year %in% years)
ggplot(anes_sub, aes(demtherm, fill = gender)) +
geom_density(alpha = 0.3, adjust = 1.8, colour = NA) +
xlab("Democratic Thermometer") +
ylab("Density") +
scale_fill_manual(values = c("black", "darkblue")) +
theme_bw() +
facet_wrap(~year)
Here’s a scatter plot examining the relationship between a respondent’s age and feeling thermometer for the democratic party in 2004 and 2008:
anes_sub <- subset(anes, year %in% c(2004, 2008))
ggplot(anes_sub, aes(x = age, y = demtherm)) +
geom_point(size = 0.6) +
xlab("Respondent's Age") +
ylab("Feeling Thermometer for Democratic Party") +
theme_bw() +
facet_wrap(~year)
Add LOESS smoothers:
ggplot(anes_sub, aes(x = age, y = demtherm)) +
geom_point(size = 0.6) +
xlab("Respondent's Age") +
ylab("Feeling Thermometer for Democratic Party") +
theme_bw() +
facet_wrap(~year) +
geom_smooth(method = "loess")
Note that the clustering of points in horizontal lines has to do with respondents tending to choose even values of the democratic feeling thermometer (calling into question whether it is entirely appropriate to treat this variable as continuous).
Here are a few additional resources to consult for graphing with ggplot2
:
If you need help with something more specific, try googling it! There are a lot of online resources for getting help with ggplot
.
demtherm
) for 1978.facet_wrap()
. Inlcude LOESS smoothers.ggplot
to graph these estimates. Time should be on the x-axis, the mean estimates on the y-axis, and the points should be color coded according to the South variable. Use lines to connect the points. In one sentence, describe what you see.demtherm
variable using ggplot
. We didn’t go through exactly how to do this above, so consult online resources. In one sentence, describe what the histogram is telling you.Next week we’ll start using Monte Carlo simulations to demonstrate properties of different estimators. To implement such simulations, we’ll need to use for loops.
For loops can be used for other purpose as well. However, compared to vectorization, for loops in R
are relatively slow.
To learn to use for loops, let’s start with a very simple example. Let’s say you had the following vector v
with individuals’ height in centimeters:
v1 <- c(175, 182, 150, 187, 165)
We can convert the values in v1
from centimeters to meters using a for loop:
v2 <- rep(NA, 5) #create vector v2 with NA values
for(i in 1:5) { #loop over elements in v1 and store in v2
v2[i] <- v1[i] / 100
}
v2 #v2 after the for loop
## [1] 1.75 1.82 1.50 1.87 1.65
Note that we could have done this using a vectorized function, which is faster (when we have a lot of data) and more compact:
v2 <- v1 / 100
However, this example illustrates some points about for loops that will be useful later:
Begin by creating an object that can store the results of your for loop. In the example above, we created v2
for this purpose. With vectors, we need to specify how many elements we want to eventually store, in this case 5. (This is not true if you wanted to store the results in a list.)
The basic structure of the loop usually is:
for(i in 1:n) {
#commands to execute
}
Here ‘n’ represents the number of times you want to iterate the loop. The loop will run from 1 to n by an integer count. If you instead wanted the loop to iterate from 1 to n but skip every other number you could use seq(1, n, by = 2)
in place of 1:n
.
3. Within the for loop we want to save the output of each iteration to an element of the vector (or list) we created initially (in this case v2
).
Here’s a more general approach accomplishing the same thing, but where we keep the number of iterations flexible depending on how many elements v1
contains:
v1 <- c(175, 182, 150, 187, 165)
n <- length(v1)
v2 <- rep(NA, n)
for(i in 1:n) {
v2[i] <- v1[i] / 100
}
v2
## [1] 1.75 1.82 1.50 1.87 1.65
Of course, you can store outputs from the for loop in a vector within a data frame. Say we had the following data frame with names and heights:
ppl <- data.frame(person = letters[1:5], height_cm = v1)
ppl
## person height_cm
## 1 a 175
## 2 b 182
## 3 c 150
## 4 d 187
## 5 e 165
Let’s add a variable that expresses height in inches instead:
ppl$height_inch <- NA #add variable of NAs
n <- nrow(ppl) #get number of observations to loop over
for(i in 1:n){
ppl$height_inch[i] <- ppl$height_cm[i] * 0.393701
}
ppl
## person height_cm height_inch
## 1 a 175 68.90
## 2 b 182 71.65
## 3 c 150 59.06
## 4 d 187 73.62
## 5 e 165 64.96
Note that when adding a constant or NA
values to a vector within a data frame, R
(correctly) assumes that you want to add this constant to every element of the variable, so you don’t need to specify how many times you want to add NA
in ppl$height_inch <- NA
.
To preview what’s to come, let’s use a for loop to draw repeated samples from a population. In the process, we’ll illustrate a good property about means.
Say we have a population of 10 individuals with the following heights:
v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
mean(v) #population mean
## [1] 175.6
Unfortunately, for whatever reason, we do not know the heights of all of these individuals. We can only (randomly) sample 5 of them. From this random sample of five individuals we estimate the height of all 10 individuals. We can draw a sample of 5 from v
and take the mean of this sample using the following code:
v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
smpl <- sample(v, 5)
mean(smpl)
## [1] 169.4
Would we on average expect to estimate the mean of the population accurately? Let’s use a Monte Carlo simulation to find out. We’ll draw 10,000 random samples of five from v
and take the mean of each of these samples. With an unbiased estimator we would, on average, expect the sample estimand (in this case the mean) to equal the population parameter of interest (175.6).
n <- 10000
smpl_means <- rep(NA, n)
for(i in 1:n){
smpl <- sample(v, 5)
smpl_means[i] <- mean(smpl)
}
mean(smpl_means)
## [1] 175.6
The mean of the sample means (175.6487) is very close to the population mean (175.6)—on average, we’re accurately estimating the population mean with our random sample of five individuals.
Note, though, that in some cases our estimate is quite far from the population mean. To illustrate this, we can plot a histogram of the sample means:
ggplot(data.frame(smpl_means), aes(x = smpl_means)) +
geom_histogram(binwidth = 2) +
geom_vline(xintercept = mean(v), color = "red", linetype = 2) +
xlab("Sample mean (n = 5)") +
ylab("Number of samples") +
theme_bw()
The dashed red line indicates the population mean. While our sample estimates are centered around this mean (good news!), the range of the estimates is quite large. In fact, about 10% of the time we’d get an estimate of the mean that is either almost 9 centimeters (3.5 inches) below the actual mean or almost 9 centimeters above the actual mean:
quantile(smpl_means, probs = c(0.05, 0.95))
## 5% 95%
## 166.8 184.4
Use a for loop to take the square root of each value in the following vector: vec1 <- c(4, 9, 81, 100, 1000, 10^6)
. Save the results to a new vector called vec2
.
Monte Carlo Simulation: Imagine that the values in the vector pop
below represent vote shares for a presidential candidate across the 3,144 counties in the United States. If we were to take a sample of 50 counties and estimate mean support for the presidential candidate, would we, on average, estimate the vote share across all counties accurately? (Don’t worry about the fact that we really should be weighing counties by their population size to estimate overall support.) Draw 10,000 samples of 50 counties from pop
and estimate mean support for each sample, saving each mean estimate into a vector called smpl_means
. How does the mean of the sample means compare to the population mean? Do we, on average, do a good job of estimating the population mean?
pop <- runif(n = 3144, min = 0, max = 1)