In this tutorial we’ll use the ANES cumulative file available here and on Coursework. Before we begin, let’s take care of some preliminary tasks, including loading the packages we need and importing this file.
# Load packages (install before if haven't already)
libs <- c("foreign", "plyr", "dplyr", "ggplot2")
sapply(libs, require, character.only = TRUE)
## foreign plyr dplyr ggplot2
## TRUE TRUE TRUE TRUE
# Import the anescum.dta file
setwd("~/dropbox/155/tutorial2")
system("unzip ANEScumul_data_codebook.zip") #unzip (using batch command)
anes <- read.dta("anescum.dta", warn.missing.labels = F) #read file
system("rm -r anescum* __M*") #delete unzipped files (using batch command)
# Keep only some variables and rename them
# see https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
# for the combined select/rename function used here
anes <- anes %>%
select(year = VCF0004,
age = VCF0101,
gender = VCF0104,
race = VCF0106,
edu = VCF0140,
south = VCF0113,
income = VCF0114,
partyid = VCF0301,
interest = VCF0310,
govtrust = VCF0604,
abortion = VCF0838,
demtherm = VCF0218)
To learn more about the variables we kept, consult the ANES codebook included in the .zip file above.
You can learn more about using batch commands with system()
here. These commands may work slightly different in Windows.
Tutorial 1 focused heavily on data manipulation. It is worth spending a little more time on this. In particular, let’s go over how to recode factor variables into numeric variables.
The abortion variable is a factor variable with 6 categories, 4 of which are meaningful for our purposes:
class(anes$abortion)
## [1] "factor"
summary(anes$abortion)
## 0. NA; no Post IW; version NEW (2008)
## 1594
## 1. By law, abortion should never be permitted
## 2757
## 2. The law should permit abortion only in case of rape,
## 6726
## 3. The law should permit abortion for reasons other than
## 3710
## 4. By law, a woman should always be able to obtain an
## 8841
## 9. DK; other
## 473
## NA's
## 25659
Say we want to create a dummy variable that equals 1 if the respondent believes women should always be able to obtain abortion, and 0 otherwise (coding categories 0 and 9 as NA). Here are three ways of doing this:
# Option 1 (using ifelse)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion == "1. By law, abortion should never be permitted" |
abortion == "2. The law should permit abortion only in case of rape," |
abortion == "3. The law should permit abortion for reasons other than", 0, NA)))
table(anes$abortionD)
##
## 0 1
## 13193 8841
# Option 2 (using ifelse and %in%)
anes$abortionD <- with(anes,
ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
ifelse(abortion %in% c("1. By law, abortion should never be permitted",
"2. The law should permit abortion only in case of rape,",
"3. The law should permit abortion for reasons other than"), 0, NA)))
table(anes$abortionD)
##
## 0 1
## 13193 8841
# Option 3 (recode factor levels, then convert to numeric)
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(as.character(anes$abortionD)) #convert to numeric
table(anes$abortionD)
##
## 0 1
## 13193 8841
Options 1 and 2 both use ifelse()
, which turns out to be quite cumbersome here because of the long labels. Option 3 requires less code, and may therefore be preferred.
However, when using Option 3, make sure you correctly convert the variable from factor to numeric. If you want to convert numbers stored as factor levels in R
to numeric values, you need to first convert them to character, then convert them to numeric values. Above, this is done using as.numeric(as.character(anes$abortionD))
.
If we were to convert the variable from factor to numeric without the intermediate step of converting them to character the following happens:
anes$abortionD <- anes$abortion #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA) #recode levels
anes$abortionD <- as.numeric(anes$abortionD) #FAIL
table(anes$abortionD)
##
## 1 2
## 13193 8841
R
converts the numbers to ‘1’ and ‘2’ instead of ‘0’ and ‘1’. This has to do with how R
stores factor levels internally. If you want to convert a factor variable to numeric, always remember to convert factors using as.numeric(as.character(var))
where var
is your variable of interest.
incomeD
which recodes income
in the anes
data frame into a (numeric) dummy variable that equals 1 if the respondent’s income is in the 68th percentile or above and 0 otherwise.Let’s now begin some basic data analysis. As part of this, it is important to get a sense for what variables look like separately (their univariate distributions). The univariate descriptive statistics you produce will depend on whether the variable is categorical or numeric.
Let’s first recode the 7-point party ID variable to a 3-point party ID variable:
table(anes$partyid)
##
## 0. DK; NA; other; refused to answer; no Pre IW
## 968
## 1. Strong Democrat
## 9320
## 2. Weak Democrat
## 10390
## 3. Independent - Democrat
## 5649
## 4. Independent - Independent
## 5617
## 5. Independent - Republican
## 4772
## 6. Weak Republican
## 6790
## 7. Strong Republican
## 5592
anes$partyid3 <- anes$partyid
levels(anes$partyid3) <- c(NA, "Democrat", "Democrat", "Democrat",
"Independent", "Republican", "Republican",
"Republican")
table(anes$partyid3)
##
## Democrat Independent Republican
## 25359 5617 17154
As is already apparent, table()
is a useful function for getting the distribution of a categorical variable. To get the proportion of respondents in each category, you can add prop.table()
:
prop.table(table(anes$partyid3))
##
## Democrat Independent Republican
## 0.5269 0.1167 0.3564
We’ll use the package ggplot2
for most of our graphing (see below). It produces high-quality, customizable graphs for publication. However, if you want to quickly examine the distribution of a categorical variable, you can use barplot()
in combination with table()
and prop.table()
:
barplot(table(anes$partyid3))
barplot(prop.table(table(anes$partyid3)))
There are many commands that will help you learn about the distribution of a variable—e.g., mean()
, median()
, min()
, max()
, and sd()
. Remember to include na.rm = T
if the variable includes NA
values (though also beware of the biases missing data may introduce). Here are some examples, using the demtherm
variable (a feeling thermometer for the democratic party).
# First recode the demtherm variable (values 98 and 99 indicate missing data)
anes$demtherm <- ifelse(anes$demtherm %in% 98:99, NA, identity(anes$demtherm))
# Find a few different descriptive statistics
mean(anes$demtherm, na.rm = T)
## [1] 60.34
median(anes$demtherm, na.rm = T)
## [1] 60
sd(anes$demtherm, na.rm = T)
## [1] 23.07
summary()
provides many of these statistics in one command:
summary(anes$demtherm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 50 60 60 75 97 24599
To find specific percentiles of a numeric variable, use quantile()
:
quantile(anes$demtherm, probs = c(0.1, 0.5, 0.9), na.rm = T) #10th, 50th, and 90th percentiles
## 10% 50% 90%
## 30 60 95
quantile(anes$demtherm, probs = seq(0, 1, by = 0.05), na.rm = T) #percentiles in increments of 5
## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70%
## 0 15 30 40 40 50 50 50 50 60 60 60 70 70 70
## 75% 80% 85% 90% 95% 100%
## 75 85 85 95 97 97
Specify the percentiles you want in a vector using the option probs
in the function.
We can quickly get a sense for the distribution of a continuous variable by plotting a histogram of it (though use ggplot
for your papers):
hist(anes$demtherm)
Dummy variables are a special case of numeric, categorical variables. Because they are so useful, it is worth going over what taking the mean of a dummy variable means. For example, interpret the mean of the abortion dummy variable that we created above:
anes$abortionD <- anes$abortion
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)
anes$abortionD <- as.numeric(as.character(anes$abortionD))
mean(anes$abortionD, na.rm = T)
## [1] 0.4012
interest
variable (a categorical variable measuring interest in politics). More specifically, find the number and proportion of observations for each category of the variable.demtherm
or abortionD
, which we examined above): mean, median, standard deviation, min and max value, and the 5th and 95th percentiles.Examining the relationship between two (or more) variables is usually more interesting. Basic bivariate descriptive statistics come in the form of conditional means, crosstabs, and graphing. (Graphing is covered in a different section below.) What type of descriptive statistics you provide depends on the nature of your variables.
Let’s examine the relationship between gender and party identification. First clean the gender variable:
levels(anes$gender) <- c(NA, "Male", "Female")
table(anes$gender)
##
## Male Female
## 22017 27640
Here’s a simple crosstab:
table(anes$gender, anes$partyid3)
##
## Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
with(anes, table(gender, partyid3)) #same as above
## partyid3
## gender Democrat Independent Republican
## Male 10803 2596 8039
## Female 14556 3021 9115
prop.table()
with the correct margin specification is more interesting:
t <- with(anes, table(gender, partyid3))
prop.table(t, margin = 1)
## partyid3
## gender Democrat Independent Republican
## Male 0.5039 0.1211 0.3750
## Female 0.5453 0.1132 0.3415
Compare with:
prop.table(t)
## partyid3
## gender Democrat Independent Republican
## Male 0.22445 0.05394 0.16703
## Female 0.30243 0.06277 0.18938
prop.table(t, margin = 2)
## partyid3
## gender Democrat Independent Republican
## Male 0.4260 0.4622 0.4686
## Female 0.5740 0.5378 0.5314
What is the difference between prop.table(t)
, prop.table(t, margin = 1)
, and prop.table(t, margin = 2)
? Make sure you can interpret the output.
An alternative way to find bivariate relationships when you have two categorical variables is to recode your proposed dependent variable to a dummy variable and use ddply()
from package plyr
. (I personally use this approach much more frequently than prop.table()
.)
anes$democrat <- anes$partyid3
levels(anes$democrat) <- c(1, 0, 0)
anes$democrat <- as.numeric(as.character(anes$democrat))
ddply(anes, .(gender), summarise, mean = mean(democrat, na.rm = T))
## gender mean
## 1 Male 0.5039
## 2 Female 0.5453
## 3 <NA> NaN
In the code above I first created a dummy variable called democrat
that equals 1 if the respondent (weakly) identifies with the democratic party and 0 otherwise. I then found the mean of this variable by gender using ddply()
.
Notes about ddply
, a very useful function:
anes
.summarise
when we want to find a summary statistic.mean()
. We take the mean of the dependent variable.We can specify more than one function in the last argument:
ddply(anes, .(gender), summarise,
mean = mean(democrat, na.rm = T),
median = median(democrat, na.rm = T))
## gender mean median
## 1 Male 0.5039 1
## 2 Female 0.5453 1
## 3 <NA> NaN NA
We can also find the mean by more than one variable. For example, we can find mean democratic identification by gender and year (say for years 1970-2000 to reduce the output):
anes_sub <- subset(anes, year %in% 1970:1990)
ddply(anes_sub, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
## year gender mean
## 1 1970 Male 0.5437
## 2 1970 Female 0.5447
## 3 1972 Male 0.4845
## 4 1972 Female 0.5336
## 5 1974 Male 0.4970
## 6 1974 Female 0.5278
## 7 1976 Male 0.4941
## 8 1976 Female 0.5230
## 9 1978 Male 0.5215
## 10 1978 Female 0.5457
## 11 1980 Male 0.4870
## 12 1980 Female 0.5479
## 13 1982 Male 0.5032
## 14 1982 Female 0.5912
## 15 1984 Male 0.4549
## 16 1984 Female 0.4960
## 17 1986 Male 0.4730
## 18 1986 Female 0.5288
## 19 1988 Male 0.4263
## 20 1988 Female 0.5043
## 21 1990 Male 0.4966
## 22 1990 Female 0.5322
When we have a categorical independent variable and a continuous dependent variable, finding conditional means using ddply()
again is useful. Take for example the relationship between income and the democratic feeling thermometer:
ddply(anes, .(income), summarise, mean = mean(demtherm, na.rm = T),
median = median(demtherm, na.rm = T))
## income mean median
## 1 0. DK; NA; refused to answer; no Pre IW 60.17 60
## 2 1. 0 to 16 percentile 67.67 70
## 3 2. 17 to 33 percentile 64.58 65
## 4 3. 34 to 67 percentile 59.88 60
## 5 4. 68 to 95 percentile 55.92 60
## 6 5. 96 to 100 percentile 50.84 50
## 7 <NA> NaN NA
The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below). As a complement, you may want to find the Pearson correlation between the two variables. Let’s find the correlation between age
and demtherm
(after fixing age
):
sort(unique(anes$age)) #age variable has 0-year olds
## [1] 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [24] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [47] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [70] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
anes$age[anes$age == 0] <- NA #get rid of 0-year olds
# Pearson correlation
cor(anes$age, anes$demtherm, use = "complete.obs")
## [1] 0.08352
Create a crosstab of south
and the 3-point party ID variable, separately for 1952 and 2008. (Recode the 7-point party ID variable if you haven’t already.) The crosstab should be able to tell you to what extent people in the south are Democrat, Independent, and Republican (and likewise for the non-south). How did this change between 1952 and 2008?
Now create a Democrat dummy variable from the party ID variable. The variable should equal 1 if the respondent (weakly) identifies with the Democratic party and 0 if the respondent is Republican or (purely) Independent. Find the mean of this variable for people in the south and non-south using ddply()
, again for years 1952 and 2008. Compare with the results from Exercise 1.
Use ddply()
to find mean Democratic identification, by South and year—that is, by whether the respondent lives in the south for every year in the data set. Use the dummy variable from Exercise 2.
Use ddply()
to find both mean Democratic identification and the standard deviation of this variable, by South and year. Use ddply()
only once.
Let’s get started with ggplot2
, a very powerful graphic package for R
. What follows is not a comprehensive survey of every possible graphing technique with ggplot2
—rather it provides three examples aimed to get you started using the package. See the end of this section for more resources.
Say we wanted to plot the proportion of ANES respondents who say they identify with the Democratic party, separately for men and women and for each year in the dataset.
First, let’s use ddply()
, where the variables to split the data by are year
and gender
, and where the democrat
dummy variable we created above is the dependent variable of interest:
t <- ddply(anes, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
tail(t)
## year gender mean
## 55 2002 Male 0.4501
## 56 2002 Female 0.5025
## 57 2004 Male 0.4462
## 58 2004 Female 0.5369
## 59 2008 Male 0.5520
## 60 2008 Female 0.6336
Now create a very basic plot with year on the x-axis, proportion who identify with the democratic party on the y-axis, and a color scheme that differentiates women from men:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point()
The code above consists of two functions added together using +
:
ggplot()
specifies
t
),x
, y
, and color
) specified inside aes()
. Note that these attributes reference different column names in t
.geom_point()
tells R
to add points.The plot can be improved by adding lines:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line()
Change x- and y-labels and add a black and white theme:
ggplot(t, aes(x = year, y = mean, color = gender)) +
geom_point() +
geom_line() +
xlab("") +
ylab("Proportion Identifying as Democrat") +
theme_bw()