1 Preliminaries

In this tutorial we’ll use the ANES cumulative file available here and on Coursework. Before we begin, let’s take care of some preliminary tasks, including loading the packages we need and importing this file.

# Load packages (install before if haven't already)
libs <- c("foreign", "plyr", "dplyr", "ggplot2")
sapply(libs, require, character.only = TRUE)
## foreign    plyr   dplyr ggplot2
##    TRUE    TRUE    TRUE    TRUE
# Import the anescum.dta file
setwd("~/dropbox/155/tutorial2")
system("unzip ANEScumul_data_codebook.zip")                #unzip (using batch command)
anes <- read.dta("anescum.dta", warn.missing.labels = F)   #read file
system("rm -r anescum* __M*")                              #delete unzipped files (using batch command)

# Keep only some variables and rename them
# see https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
# for the combined select/rename function used here
anes <- anes %>%
        select(year = VCF0004,
               age = VCF0101,
               gender = VCF0104,
               race = VCF0106,
               edu = VCF0140,
               south = VCF0113,
               income = VCF0114,
               partyid = VCF0301,
               interest = VCF0310,
               govtrust = VCF0604,
               abortion = VCF0838,
               demtherm = VCF0218)

To learn more about the variables we kept, consult the ANES codebook included in the .zip file above.

You can learn more about using batch commands with system() here. These commands may work slightly different in Windows.


2 Basic Data Manipulation (Cont.)

Tutorial 1 focused heavily on data manipulation. It is worth spending a little more time on this. In particular, let’s go over how to recode factor variables into numeric variables.

The abortion variable is a factor variable with 6 categories, 4 of which are meaningful for our purposes:

class(anes$abortion)
## [1] "factor"
summary(anes$abortion)
##                    0. NA; no Post IW; version NEW (2008)
##                                                     1594
##            1. By law, abortion should never be permitted
##                                                     2757
##  2. The law should permit abortion only in case of rape,
##                                                     6726
## 3. The law should permit abortion for reasons other than
##                                                     3710
##    4. By law, a woman should always be able to obtain an
##                                                     8841
##                                             9. DK; other
##                                                      473
##                                                     NA's
##                                                    25659

Say we want to create a dummy variable that equals 1 if the respondent believes women should always be able to obtain abortion, and 0 otherwise (coding categories 0 and 9 as NA). Here are three ways of doing this:

# Option 1 (using ifelse)
anes$abortionD <- with(anes,
   ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
   ifelse(abortion == "1. By law, abortion should never be permitted" |
          abortion == "2. The law should permit abortion only in case of rape," |
          abortion == "3. The law should permit abortion for reasons other than", 0, NA)))
table(anes$abortionD)
##
##     0     1
## 13193  8841
# Option 2 (using ifelse and %in%)
anes$abortionD <- with(anes,
   ifelse(abortion == "4. By law, a woman should always be able to obtain an", 1,
   ifelse(abortion %in% c("1. By law, abortion should never be permitted",
                          "2. The law should permit abortion only in case of rape,",
                          "3. The law should permit abortion for reasons other than"), 0, NA)))
table(anes$abortionD)
##
##     0     1
## 13193  8841
# Option 3 (recode factor levels, then convert to numeric)
anes$abortionD <- anes$abortion                               #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)               #recode levels
anes$abortionD <- as.numeric(as.character(anes$abortionD))    #convert to numeric
table(anes$abortionD)
##
##     0     1
## 13193  8841

Options 1 and 2 both use ifelse(), which turns out to be quite cumbersome here because of the long labels. Option 3 requires less code, and may therefore be preferred.

However, when using Option 3, make sure you correctly convert the variable from factor to numeric. If you want to convert numbers stored as factor levels in R to numeric values, you need to first convert them to character, then convert them to numeric values. Above, this is done using as.numeric(as.character(anes$abortionD)).

If we were to convert the variable from factor to numeric without the intermediate step of converting them to character the following happens:

anes$abortionD <- anes$abortion                               #clone variable
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)               #recode levels
anes$abortionD <- as.numeric(anes$abortionD)                  #FAIL
table(anes$abortionD)
##
##     1     2
## 13193  8841

R converts the numbers to ‘1’ and ‘2’ instead of ‘0’ and ‘1’. This has to do with how R stores factor levels internally. If you want to convert a factor variable to numeric, always remember to convert factors using as.numeric(as.character(var)) where var is your variable of interest.


2.1 Exercises

  1. Create a new variable called incomeD which recodes income in the anes data frame into a (numeric) dummy variable that equals 1 if the respondent’s income is in the 68th percentile or above and 0 otherwise.

3 Univariate Descriptive Statistics

Let’s now begin some basic data analysis. As part of this, it is important to get a sense for what variables look like separately (their univariate distributions). The univariate descriptive statistics you produce will depend on whether the variable is categorical or numeric.


3.1 Categorical variables

Let’s first recode the 7-point party ID variable to a 3-point party ID variable:

table(anes$partyid)
##
## 0. DK; NA; other; refused to answer; no Pre IW
##                                            968
##                             1. Strong Democrat
##                                           9320
##                               2. Weak Democrat
##                                          10390
##                      3. Independent - Democrat
##                                           5649
##                   4. Independent - Independent
##                                           5617
##                    5. Independent - Republican
##                                           4772
##                             6. Weak Republican
##                                           6790
##                           7. Strong Republican
##                                           5592
anes$partyid3 <- anes$partyid
levels(anes$partyid3) <- c(NA, "Democrat", "Democrat", "Democrat",
                           "Independent", "Republican", "Republican",
                           "Republican")

table(anes$partyid3)
##
##    Democrat Independent  Republican
##       25359        5617       17154

As is already apparent, table() is a useful function for getting the distribution of a categorical variable. To get the proportion of respondents in each category, you can add prop.table():

prop.table(table(anes$partyid3))
##
##    Democrat Independent  Republican
##      0.5269      0.1167      0.3564


We’ll use the package ggplot2 for most of our graphing (see below). It produces high-quality, customizable graphs for publication. However, if you want to quickly examine the distribution of a categorical variable, you can use barplot() in combination with table() and prop.table():

barplot(table(anes$partyid3))

plot of chunk unnamed-chunk-7

barplot(prop.table(table(anes$partyid3)))

plot of chunk unnamed-chunk-7


3.2 Numeric variables

There are many commands that will help you learn about the distribution of a variable—e.g., mean(), median(), min(), max(), and sd(). Remember to include na.rm = T if the variable includes NA values (though also beware of the biases missing data may introduce). Here are some examples, using the demtherm variable (a feeling thermometer for the democratic party).

# First recode the demtherm variable (values 98 and 99 indicate missing data)
anes$demtherm <- ifelse(anes$demtherm %in% 98:99, NA, identity(anes$demtherm))

# Find a few different descriptive statistics
mean(anes$demtherm, na.rm = T)
## [1] 60.34
median(anes$demtherm, na.rm = T)
## [1] 60
sd(anes$demtherm, na.rm = T)
## [1] 23.07

summary() provides many of these statistics in one command:

summary(anes$demtherm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##       0      50      60      60      75      97   24599


To find specific percentiles of a numeric variable, use quantile():

quantile(anes$demtherm, probs = c(0.1, 0.5, 0.9), na.rm = T)        #10th, 50th, and 90th percentiles 
## 10% 50% 90%
##  30  60  95
quantile(anes$demtherm, probs = seq(0, 1, by = 0.05), na.rm = T)    #percentiles in increments of 5
##   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%
##    0   15   30   40   40   50   50   50   50   60   60   60   70   70   70
##  75%  80%  85%  90%  95% 100%
##   75   85   85   95   97   97

Specify the percentiles you want in a vector using the option probs in the function.

We can quickly get a sense for the distribution of a continuous variable by plotting a histogram of it (though use ggplot for your papers):

hist(anes$demtherm)

plot of chunk unnamed-chunk-11

3.2.1 Dummy variables

Dummy variables are a special case of numeric, categorical variables. Because they are so useful, it is worth going over what taking the mean of a dummy variable means. For example, interpret the mean of the abortion dummy variable that we created above:

anes$abortionD <- anes$abortion
levels(anes$abortionD) <- c(NA, 0, 0, 0, 1, NA)
anes$abortionD <- as.numeric(as.character(anes$abortionD))
mean(anes$abortionD, na.rm = T)
## [1] 0.4012

3.3 Exercises

  1. Find the distribution of the interest variable (a categorical variable measuring interest in politics). More specifically, find the number and proportion of observations for each category of the variable.
  2. Find the following descriptive statistics of a numeric variable of your choice (though not demtherm or abortionD, which we examined above): mean, median, standard deviation, min and max value, and the 5th and 95th percentiles.

4 Bivariate Descriptive Statistics

Examining the relationship between two (or more) variables is usually more interesting. Basic bivariate descriptive statistics come in the form of conditional means, crosstabs, and graphing. (Graphing is covered in a different section below.) What type of descriptive statistics you provide depends on the nature of your variables.


4.1 Categorical IV and DV

Let’s examine the relationship between gender and party identification. First clean the gender variable:

levels(anes$gender) <- c(NA, "Male", "Female")
table(anes$gender)
##
##   Male Female
##  22017  27640

Here’s a simple crosstab:

table(anes$gender, anes$partyid3)      
##
##          Democrat Independent Republican
##   Male      10803        2596       8039
##   Female    14556        3021       9115
with(anes, table(gender, partyid3))    #same as above
##         partyid3
## gender   Democrat Independent Republican
##   Male      10803        2596       8039
##   Female    14556        3021       9115

prop.table() with the correct margin specification is more interesting:

t <- with(anes, table(gender, partyid3))
prop.table(t, margin = 1)
##         partyid3
## gender   Democrat Independent Republican
##   Male     0.5039      0.1211     0.3750
##   Female   0.5453      0.1132     0.3415

Compare with:

prop.table(t)
##         partyid3
## gender   Democrat Independent Republican
##   Male    0.22445     0.05394    0.16703
##   Female  0.30243     0.06277    0.18938
prop.table(t, margin = 2)
##         partyid3
## gender   Democrat Independent Republican
##   Male     0.4260      0.4622     0.4686
##   Female   0.5740      0.5378     0.5314

What is the difference between prop.table(t), prop.table(t, margin = 1), and prop.table(t, margin = 2)? Make sure you can interpret the output.


An alternative way to find bivariate relationships when you have two categorical variables is to recode your proposed dependent variable to a dummy variable and use ddply() from package plyr. (I personally use this approach much more frequently than prop.table().)

anes$democrat <- anes$partyid3
levels(anes$democrat) <- c(1, 0, 0)
anes$democrat <- as.numeric(as.character(anes$democrat))

ddply(anes, .(gender), summarise, mean = mean(democrat, na.rm = T))
##   gender   mean
## 1   Male 0.5039
## 2 Female 0.5453
## 3   <NA>    NaN

In the code above I first created a dummy variable called democrat that equals 1 if the respondent (weakly) identifies with the democratic party and 0 otherwise. I then found the mean of this variable by gender using ddply().

Notes about ddply, a very useful function:

  • It takes four (or more) arguments.
  • The first argument is the data frame, in this case anes.
  • The second argument specifies the independent variable, or the variable we want to split the data by.
  • The third argument is always summarise when we want to find a summary statistic.
  • The fourth argument takes a function, in this case mean(). We take the mean of the dependent variable.

We can specify more than one function in the last argument:

ddply(anes, .(gender), summarise,
         mean = mean(democrat, na.rm = T),
         median = median(democrat, na.rm = T))
##   gender   mean median
## 1   Male 0.5039      1
## 2 Female 0.5453      1
## 3   <NA>    NaN     NA

We can also find the mean by more than one variable. For example, we can find mean democratic identification by gender and year (say for years 1970-2000 to reduce the output):

anes_sub <- subset(anes, year %in% 1970:1990)
ddply(anes_sub, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
##    year gender   mean
## 1  1970   Male 0.5437
## 2  1970 Female 0.5447
## 3  1972   Male 0.4845
## 4  1972 Female 0.5336
## 5  1974   Male 0.4970
## 6  1974 Female 0.5278
## 7  1976   Male 0.4941
## 8  1976 Female 0.5230
## 9  1978   Male 0.5215
## 10 1978 Female 0.5457
## 11 1980   Male 0.4870
## 12 1980 Female 0.5479
## 13 1982   Male 0.5032
## 14 1982 Female 0.5912
## 15 1984   Male 0.4549
## 16 1984 Female 0.4960
## 17 1986   Male 0.4730
## 18 1986 Female 0.5288
## 19 1988   Male 0.4263
## 20 1988 Female 0.5043
## 21 1990   Male 0.4966
## 22 1990 Female 0.5322

4.2 Categorical IV, Continuous DV

When we have a categorical independent variable and a continuous dependent variable, finding conditional means using ddply() again is useful. Take for example the relationship between income and the democratic feeling thermometer:

ddply(anes, .(income), summarise, mean = mean(demtherm, na.rm = T),
                                  median = median(demtherm, na.rm = T))
##                                    income  mean median
## 1 0. DK; NA; refused to answer; no Pre IW 60.17     60
## 2                   1. 0 to 16 percentile 67.67     70
## 3                  2. 17 to 33 percentile 64.58     65
## 4                  3. 34 to 67 percentile 59.88     60
## 5                  4. 68 to 95 percentile 55.92     60
## 6                 5. 96 to 100 percentile 50.84     50
## 7                                    <NA>   NaN     NA

4.3 Continuous IV and DV

The relationship between two continuous variables is most commonly investigated using scatter plots (see graphing section below). As a complement, you may want to find the Pearson correlation between the two variables. Let’s find the correlation between age and demtherm (after fixing age):

sort(unique(anes$age))            #age variable has 0-year olds 
##  [1]  0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [24] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [47] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [70] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
anes$age[anes$age == 0] <- NA     #get rid of 0-year olds

# Pearson correlation
cor(anes$age, anes$demtherm, use = "complete.obs")
## [1] 0.08352

4.4 Exercises

  1. Create a crosstab of south and the 3-point party ID variable, separately for 1952 and 2008. (Recode the 7-point party ID variable if you haven’t already.) The crosstab should be able to tell you to what extent people in the south are Democrat, Independent, and Republican (and likewise for the non-south). How did this change between 1952 and 2008?

  2. Now create a Democrat dummy variable from the party ID variable. The variable should equal 1 if the respondent (weakly) identifies with the Democratic party and 0 if the respondent is Republican or (purely) Independent. Find the mean of this variable for people in the south and non-south using ddply(), again for years 1952 and 2008. Compare with the results from Exercise 1.

  3. Use ddply() to find mean Democratic identification, by South and year—that is, by whether the respondent lives in the south for every year in the data set. Use the dummy variable from Exercise 2.

  4. Use ddply() to find both mean Democratic identification and the standard deviation of this variable, by South and year. Use ddply() only once.


5 Graphing with ggplot

Let’s get started with ggplot2, a very powerful graphic package for R. What follows is not a comprehensive survey of every possible graphing technique with ggplot2—rather it provides three examples aimed to get you started using the package. See the end of this section for more resources.


5.1 Mean Democratic Identification, by Gender and Year

Say we wanted to plot the proportion of ANES respondents who say they identify with the Democratic party, separately for men and women and for each year in the dataset.

First, let’s use ddply(), where the variables to split the data by are year and gender, and where the democrat dummy variable we created above is the dependent variable of interest:

t <- ddply(anes, .(year, gender), summarise, mean = mean(democrat, na.rm = T))
tail(t)
##    year gender   mean
## 55 2002   Male 0.4501
## 56 2002 Female 0.5025
## 57 2004   Male 0.4462
## 58 2004 Female 0.5369
## 59 2008   Male 0.5520
## 60 2008 Female 0.6336

Now create a very basic plot with year on the x-axis, proportion who identify with the democratic party on the y-axis, and a color scheme that differentiates women from men:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() 

plot of chunk unnamed-chunk-23

The code above consists of two functions added together using +:

  1. ggplot() specifies
    • the data frame we want to use for plotting (in this case t),
    • the aesthetic attributes of the plot (in this case x, y, and color) specified inside aes(). Note that these attributes reference different column names in t.
  2. geom_point() tells R to add points.

The plot can be improved by adding lines:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() +
    geom_line()

plot of chunk unnamed-chunk-24

Change x- and y-labels and add a black and white theme:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() +
    geom_line() +
    xlab("") +
    ylab("Proportion Identifying as Democrat") +
    theme_bw()

plot of chunk unnamed-chunk-25

Change the legend title, points/lines colors, and x-axis breaks:

ggplot(t, aes(x = year, y = mean, color = gender)) +
    geom_point() +
    geom_line() +
    xlab("") +
    ylab("Proportion Identifying as Democrat") +
    theme_bw() +
    scale_colour_manual("Respondent's\nGender", values = c("black", "red")) +
    scale_x_continuous(breaks = seq(1950, 2010, by = 10))

plot of chunk unnamed-chunk-26


5.2 Democratic Thermometer Density Plots

Basic density plot of demtherm by gender:

ggplot(anes, aes(demtherm, fill = gender)) +
    geom_density()

plot of chunk unnamed-chunk-27

Change the sensitivity of the density estimates (adjust = 2), make fill colors transparent (alpha = 0.3), and remove the black color outlines (colour = NA):

ggplot(anes, aes(demtherm, fill = gender)) +
    geom_density(alpha = 0.3, adjust = 2, colour = NA)

plot of chunk unnamed-chunk-28

Add a few more adjustments:

ggplot(anes, aes(demtherm, fill = gender)) +
    geom_density(alpha = 0.3, adjust = 2, colour = NA) +
    xlab("Democratic Thermometer") +
    ylab("Density") +
    scale_fill_manual(values = c("black", "darkblue")) +
    theme_bw()

plot of chunk unnamed-chunk-29

Same density plot for eight years in the data set using facet_wrap(~year):

years <- seq(1980, 2008, by = 4)
anes_sub <- subset(anes, year %in% years)

ggplot(anes_sub, aes(demtherm, fill = gender)) +
    geom_density(alpha = 0.3, adjust = 1.8, colour = NA) +
    xlab("Democratic Thermometer") +
    ylab("Density") +
    scale_fill_manual(values = c("black", "darkblue")) +
    theme_bw() +
    facet_wrap(~year)

plot of chunk unnamed-chunk-30


5.3 Scatter Plots

Here’s a scatter plot examining the relationship between a respondent’s age and feeling thermometer for the democratic party in 2004 and 2008:

anes_sub <- subset(anes, year %in% c(2004, 2008))

ggplot(anes_sub, aes(x = age, y = demtherm)) +
    geom_point(size = 0.6) +
    xlab("Respondent's Age") +
    ylab("Feeling Thermometer for Democratic Party") +
    theme_bw() +
    facet_wrap(~year)

plot of chunk unnamed-chunk-31

Add LOESS smoothers:

ggplot(anes_sub, aes(x = age, y = demtherm)) +
    geom_point(size = 0.6) +
    xlab("Respondent's Age") +
    ylab("Feeling Thermometer for Democratic Party") +
    theme_bw() +
    facet_wrap(~year) +
    geom_smooth(method = "loess")

plot of chunk unnamed-chunk-32

Note that the clustering of points in horizontal lines has to do with respondents tending to choose even values of the democratic feeling thermometer (calling into question whether it is entirely appropriate to treat this variable as continuous).


5.4 Online Resources

Here are a few additional resources to consult for graphing with ggplot2:

If you need help with something more specific, try googling it! There are a lot of online resources for getting help with ggplot.


5.5 Exercises

  1. Create a scatter plot of age and the democratic feeling thermometer (demtherm) for 1978.
  2. Same thing as in Exercise 1 but for each of years 1978, 1986, 1994, and 1998. Use facet_wrap(). Inlcude LOESS smoothers.
  3. Go back to Exercise 3 above where you found mean Democratic identification by South and year. Use ggplot to graph these estimates. Time should be on the x-axis, the mean estimates on the y-axis, and the points should be color coded according to the South variable. Use lines to connect the points. In one sentence, describe what you see.
  4. Create a histogram of the demtherm variable using ggplot. We didn’t go through exactly how to do this above, so consult online resources. In one sentence, describe what the histogram is telling you.

6 For Loops

Next week we’ll start using Monte Carlo simulations to demonstrate properties of different estimators. To implement such simulations, we’ll need to use for loops.

For loops can be used for other purpose as well. However, compared to vectorization, for loops in R are relatively slow.

To learn to use for loops, let’s start with a very simple example. Let’s say you had the following vector v with individuals’ height in centimeters:

v1 <- c(175, 182, 150, 187, 165)

We can convert the values in v1 from centimeters to meters using a for loop:

v2 <- rep(NA, 5)               #create vector v2 with NA values
for(i in 1:5) {                #loop over elements in v1 and store in v2
    v2[i] <- v1[i] / 100
}
v2                             #v2 after the for loop
## [1] 1.75 1.82 1.50 1.87 1.65

Note that we could have done this using a vectorized function, which is faster (when we have a lot of data) and more compact:

v2 <- v1 / 100

However, this example illustrates some points about for loops that will be useful later:

  1. Begin by creating an object that can store the results of your for loop. In the example above, we created v2 for this purpose. With vectors, we need to specify how many elements we want to eventually store, in this case 5. (This is not true if you wanted to store the results in a list.)

  2. The basic structure of the loop usually is:

for(i in 1:n) {
    #commands to execute
}

Here ‘n’ represents the number of times you want to iterate the loop. The loop will run from 1 to n by an integer count. If you instead wanted the loop to iterate from 1 to n but skip every other number you could use seq(1, n, by = 2) in place of 1:n.

3. Within the for loop we want to save the output of each iteration to an element of the vector (or list) we created initially (in this case v2).

Here’s a more general approach accomplishing the same thing, but where we keep the number of iterations flexible depending on how many elements v1 contains:

v1 <- c(175, 182, 150, 187, 165)
n <- length(v1)
v2 <- rep(NA, n)
for(i in 1:n) {
    v2[i] <- v1[i] / 100
}
v2
## [1] 1.75 1.82 1.50 1.87 1.65

Of course, you can store outputs from the for loop in a vector within a data frame. Say we had the following data frame with names and heights:

ppl <- data.frame(person = letters[1:5], height_cm = v1)
ppl
##   person height_cm
## 1      a       175
## 2      b       182
## 3      c       150
## 4      d       187
## 5      e       165

Let’s add a variable that expresses height in inches instead:

ppl$height_inch <- NA                                     #add variable of NAs
n <- nrow(ppl)                                            #get number of observations to loop over
for(i in 1:n){
    ppl$height_inch[i] <- ppl$height_cm[i] * 0.393701
}
ppl
##   person height_cm height_inch
## 1      a       175       68.90
## 2      b       182       71.65
## 3      c       150       59.06
## 4      d       187       73.62
## 5      e       165       64.96

Note that when adding a constant or NA values to a vector within a data frame, R (correctly) assumes that you want to add this constant to every element of the variable, so you don’t need to specify how many times you want to add NA in ppl$height_inch <- NA.


6.1 For Loops for Repeated Sampling

To preview what’s to come, let’s use a for loop to draw repeated samples from a population. In the process, we’ll illustrate a good property about means.

Say we have a population of 10 individuals with the following heights:

v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
mean(v)    #population mean
## [1] 175.6

Unfortunately, for whatever reason, we do not know the heights of all of these individuals. We can only (randomly) sample 5 of them. From this random sample of five individuals we estimate the height of all 10 individuals. We can draw a sample of 5 from v and take the mean of this sample using the following code:

v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
smpl <- sample(v, 5)
mean(smpl)
## [1] 169.4

Would we on average expect to estimate the mean of the population accurately? Let’s use a Monte Carlo simulation to find out. We’ll draw 10,000 random samples of five from v and take the mean of each of these samples. With an unbiased estimator we would, on average, expect the sample estimand (in this case the mean) to equal the population parameter of interest (175.6).

n <- 10000
smpl_means <- rep(NA, n)
for(i in 1:n){
    smpl <- sample(v, 5)
    smpl_means[i] <- mean(smpl)
}

mean(smpl_means)
## [1] 175.6

The mean of the sample means (175.6487) is very close to the population mean (175.6)—on average, we’re accurately estimating the population mean with our random sample of five individuals.

Note, though, that in some cases our estimate is quite far from the population mean. To illustrate this, we can plot a histogram of the sample means:

ggplot(data.frame(smpl_means), aes(x = smpl_means)) +
    geom_histogram(binwidth = 2) +
    geom_vline(xintercept = mean(v), color = "red", linetype = 2) +
    xlab("Sample mean (n = 5)") +
    ylab("Number of samples") +
    theme_bw()

plot of chunk unnamed-chunk-43

The dashed red line indicates the population mean. While our sample estimates are centered around this mean (good news!), the range of the estimates is quite large. In fact, about 10% of the time we’d get an estimate of the mean that is either almost 9 centimeters (3.5 inches) below the actual mean or almost 9 centimeters above the actual mean:

quantile(smpl_means, probs = c(0.05, 0.95))
##    5%   95%
## 166.8 184.4

6.2 Exercises

  1. Use a for loop to take the square root of each value in the following vector: vec1 <- c(4, 9, 81, 100, 1000, 10^6). Save the results to a new vector called vec2.

  2. Monte Carlo Simulation: Imagine that the values in the vector pop below represent vote shares for a presidential candidate across the 3,144 counties in the United States. If we were to take a sample of 50 counties and estimate mean support for the presidential candidate, would we, on average, estimate the vote share across all counties accurately? (Don’t worry about the fact that we really should be weighing counties by their population size to estimate overall support.) Draw 10,000 samples of 50 counties from pop and estimate mean support for each sample, saving each mean estimate into a vector called smpl_means. How does the mean of the sample means compare to the population mean? Do we, on average, do a good job of estimating the population mean?

pop <- runif(n = 3144, min = 0, max = 1)