In Tutorial 2 we learned to run for loops of the following type:
v1 <- c(175, 182, 150, 187, 165)
v2 <- rep(NA, 5)
for(i in 1:5) {
v2[i] <- v1[i] / 100
}
v2
## [1] 1.75 1.82 1.50 1.87 1.65
Here, we stored the output into a new vector v2
. I’ve included examples of more involved for loops with vector outputs in the exercises below.
While storing the output of for loops in vectors works for many purposes, sometimes we want to use lists instead of vectors. Lists are more flexible than vectors in two ways:
In the examples below, I use ANES data with only three variables, available here.
dat <- read.csv("nesexample2.csv") #(after setting my working directory)
head(dat)
## therm.unions therm.dems year
## 1 97 97 1980
## 2 85 60 1980
## 3 NA 85 1980
## 4 NA 70 1980
## 5 NA 60 1980
## 6 60 85 1980
Say we wanted to create a separate data frame for every year in this dataset. First find the years in the dataset (sorted) and store them into a vector:
years <- sort(unique(dat$year))
years
## [1] 1980 1984 1986 1988 1990 1992 1994 1996 1998 2000 2004 2008
We can then run a loop that stores a data frame for each year in a list:
l <- list() #create an empty list
n <- length(years) #define number of iterations of loop
for(i in 1:n){
l[[i]] <- subset(dat, year == years[i])
}
Many things here are similar to how we run for loops with vector outputs. Notice two differences:
[[ ]]
) to call a specific element of a list. With vectors, we use single brackets ([ ]
).l <- list()
. (Of course, you can call the list something other than l
.)Let’s look at the first few rows of a few elements in l
:
head(l[[1]])
## therm.unions therm.dems year
## 1 97 97 1980
## 2 85 60 1980
## 3 NA 85 1980
## 4 NA 70 1980
## 5 NA 60 1980
## 6 60 85 1980
head(l[[5]])
## therm.unions therm.dems year
## 7235 60 40 1990
## 7236 NA 85 1990
## 7237 97 70 1990
## 7238 85 70 1990
## 7239 NA 50 1990
## 7240 97 50 1990
head(l[[12]])
## therm.unions therm.dems year
## 18661 15 40 2008
## 18662 40 30 2008
## 18663 50 85 2008
## 18664 40 30 2008
## 18665 40 50 2008
## 18666 15 30 2008
The function lapply()
conveniently lets us operate on each element of the list. For example, we can get the dimensions of each data frame in l
like this:
lapply(l, dim)
## [[1]]
## [1] 1386 3
##
## [[2]]
## [1] 1917 3
##
## [[3]]
## [1] 2163 3
##
## [[4]]
## [1] 1768 3
##
## [[5]]
## [1] 1965 3
##
## [[6]]
## [1] 2246 3
##
## [[7]]
## [1] 1791 3
##
## [[8]]
## [1] 1533 3
##
## [[9]]
## [1] 1278 3
##
## [[10]]
## [1] 1552 3
##
## [[11]]
## [1] 1061 3
##
## [[12]]
## [1] 2092 3
Using a for loop, create a new variable in dat
called demtherm_scaled
that is a transformation of therm.dems
—it should range between 0 and 0.97 instead of 0 and 97. As a reminder to yourself, also use a vectorized function (no for loop) to create this variable. Using vectorized functions when you can is preferred; the for loop is just for practice. (No need to use a list here.)
Use a for loop to grab every other element of therm.unions
in dat
. Store the output into a new vector called thu2
. That is, if the first 6 elements of therm.unions
are 97 85 NA NA NA 60
, then the first 3 elements of thu2
should be 97 NA NA
. Output the first 6 elements of thu2
and confirm that it has half the number of elements as dat$therm.unions
.
Using a for loop, store (in a list) one scatterplot of therm.unions
versus therm.dems
for each year in dat
. Each scatter plot should include a LOESS smoother. Graph the first, fourth, and tenth element of the list. (In your write-up, include these graphs.)
Let’s begin by a class exercise based on Tuesday’s lecture. The exercise is meant to illustrate how well we do when we estimate a population mean with a sample mean. Let’s first introduce some notation.
Let \(\mu\) represent the population mean and \(\bar{x}\) the sample mean of a variable \(X\). We calculate the sample mean using
\(\frac{1}{n} \displaystyle \sum_{i=1}^n x_i\)
where \(x_i\) is each value in variable we sampled (that is, \(X\)).
We calculate confidence intervals around the sample mean using
\(\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}\)
where \(s\) is the standard deviation of the sampled variable of interest, \(n\) is the number of observations in the sample, and \(z\) is the z-score associated with the area under the standard-normal distribution we want to use for a given level of confidence. The standard deviation is calculated using
\(s = \sqrt{\frac{\sum^n_{n=1}(x_i - \bar{x})^2}{n}}\)
The z-score depends on what level of confidence we want. The standard level of confidence in the social sciences is 95%. We can find the z-score for this level of confidence in R
:
qnorm(0.975)
## [1] 1.96
Note that we have to input the proportion of the standard-normal that we want to cover on either side of the mean. For 95% confidence, we use 0.975 in qnorm()
. For 90% confidence we would use:
qnorm(0.95)
## [1] 1.645
Note that \(\frac{s}{\sqrt{n}}\) above is usually referred to as the standard error of the sample mean.
We’ll use the same ANES data as above.
dat <- read.csv("nesexample2.csv")
Let’s work with the democratic feeling thermometer.
summary(dat$therm.dems)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 50.0 60.0 59.8 75.0 97.0 707
thd <- na.omit(dat$therm.dems) #omit NA's in and store variable in vector
length(thd) #20,045 valid answers
## [1] 20045
To illustrate how well we do when we estimate a population mean with a sample mean, we will pretend that the mean of thd
is the population mean. This is clearly not true and only used for illustrative purposes.
mean(thd)
## [1] 59.84
Now let’s draw a random sample of 900 from this variable, and calculate the sample mean and confidence interval associated with the sample mean. For illustrative purposes, we’ll construct 75% confidence intervals.
n <- 900
# Generate a sample and calculate the sample mean
smpl <- sample(thd, n)
mean(smpl)
## [1] 60.87
# Generate 75% confidence intervals
std.err <- function(x) sd(x) / sqrt(length(x))
z <- qnorm(0.875) #find z-score
mean(smpl) - z * std.err(smpl) #lower CI
## [1] 59.96
mean(smpl) + z * std.err(smpl) #upper CI
## [1] 61.78
If each of you drew a different random sample, estimated the mean, and calculated 75% confidence intervals, how many of you would have confidence intervals that cover the population mean (59.8379)? Let’s find out what happens in practice!
(Exercise to complete on your own) Now run a Monte Carlo simulation repeating the process we went through above 10,000 times. Use 95% confidence intervals. Calculate the proportion of confidence intervals that cover the population mean.
Functions come in handy for both data management and analysis. In addition, using functions will make your code cleaner. For example, above we created a function called std.err
that calculates the standard error of a sample mean, which we then used to generate confidence intervals.
thd <- na.omit(dat$therm.dems)
n <- 900
smpl <- sample(thd, n)
std.err <- function(x) sd(x) / sqrt(length(x))
z <- qnorm(0.875) #find z-score
mean(smpl) - z * std.err(smpl) #lower CI
## [1] 59.5
mean(smpl) + z * std.err(smpl) #upper CI
## [1] 61.3
The last two lines equivalently could have been written:
mean(smpl) - z * (sd(smpl) / sqrt(length(smpl)))
## [1] 59.5
mean(smpl) + z * (sd(smpl) / sqrt(length(smpl)))
## [1] 61.3
Although the real gains come when you have to execute more complex operations, the advantage of using a function is clear even here. The last two lines are longer and harder to interpret than when we used the std.err
function.
Further, the nice thing about this function is that it works for any vector that we feed the function, provided that the vector doesn’t have missing values and that it has more than one observation. To make the function work for missing values we could specify it as follows:
std.err2 <- function(x) sd(x, na.rm = T) / sqrt(na.omit(length(x)))
std.err2(dat$therm.dems)
## [1] 0.1623
# Compare with original function from above:
std.err <- function(x) sd(x) / sqrt(length(x))
std.err(dat$therm.dems) #no meaningful output for vector with NAs
## [1] NA
std.err(na.omit(dat$therm.dems)) #this works
## [1] 0.1651
To write a function in R
:
Start by thinking about what you want the input(s) and output(s) of the function to be. In the std.err
and std.err2
functions, the input is a vector x and the output is a scalar (the standard error of the sample mean).
Specify the input in general terms. Above we called it x. We didn’t call it something specific, like dat$demtherm
. That’s because we want the function to work across different vectors, not just dat$demtherm
.
The inputs go in the parentheses in function()
(e.g., function(x)
). We then specify what we want to do to the inputs. We can have more than one input. For example, here’s a function that estimates the covariance between two variables (which must have the same number of observations):
covar <- function(x, y) {
df <- na.omit(data.frame(x, y))
X <- df[, 1]
Y <- df[, 2]
covariance <- (1 / (length(X) - 1)) * sum((X - mean(X)) * (Y - mean(Y)))
return(covariance)
}
This example also illustrates that a function can have more than one line, and that, if it does, we have to specify what we want the function to return. Also note that I’m defining new objects in the function, for example df
. These objects only exist within the function, and are not saved to the workspace in R
.
Let’s double check that the function works, comparing the output with the output from R
’s built-in function (something we can’t always do):
covar(dat$therm.unions, dat$therm.dems) #our user-written function
## [1] 207.8
cov(dat$therm.unions, dat$therm.dems, use = "complete") #R's built in function
## [1] 207.8
Write your own function for finding the mean of any variable. Call it mean2
. The function should work even if the variable has missing values (NA
’s). Confirm that your function works by applying it to therm.unions
in dat
. Compare the output of your function to the output of R
s mean()
function.
Write your own function for finding the standard deviation of any variable. Call it sd2
. The function should work even if the variable has missing values (NA
’s). Confirm that your function works by applying it to therm.unions
in dat
. Compare the output of your function to the output of R
s sd()
function.
The code used in this tutorial is available here.