Simon Ejdemyr January, 2016

For Loops


Contents
Summary For loops can be useful when you want to iterate a process in R — e.g., run a simulation. This tutorial explains how to write for loops and shows how to use them to run Monte Carlo simulations. For loops are neat, but it’s worth emphasizing that you should avoid them and instead use vectorization — which is much faster — when possible.

Writing a for loop

Let’s start with a very simple example. Let’s say you have the following vector v1 with individuals’ height in centimeters:

v1 <- c(175, 182, 150, 187, 165)

We can convert the values in v1 from centimeters to meters using a for loop:

v2 <- rep(NA, 5)               #create vector v2 with NA values 
for(i in 1:5) {                #loop over elements in v1 and store in v2  
    v2[i] <- v1[i] / 100
}
v2                             #v2 after the for loop
## [1] 1.75 1.82 1.50 1.87 1.65

Note that we could have done this using vectorization, which is more compact and faster (when we have a lot of data):

v2 <- v1 / 100

However, the example illustrates the following points about writing a for loop:

  1. Begin by creating an object that can store the results of your for loop. In the example above, we created v2 for this purpose. With vectors, we need to specify how many elements we want to eventually store, in this case 5. (This is not true if you wanted to store the results in a list.)

  2. The basic structure of the loop usually is:

for(i in 1:n) {
    #commands to execute for iteration i
}

Here n represents the number of times you want to iterate the loop. The loop will run from 1 to n by an integer count. If you instead wanted the loop to iterate from 1 to n but skip every other number you could use seq(1, n, by = 2) in place of 1:n.

3. Within the for loop we want to save the output of each iteration to an element of the vector (or list) we created initially (in this case v2).

Here’s a more general approach accomplishing the same thing, but where we keep the number of iterations flexible depending on how many elements v1 contains:

v1 <- c(175, 182, 150, 187, 165)
n <- length(v1)
v2 <- rep(NA, n)
for(i in 1:n) {
    v2[i] <- v1[i] / 100
}
v2
## [1] 1.75 1.82 1.50 1.87 1.65

Of course, you can store outputs from the for loop in a vector within a data frame. Say we had the following data frame with names and heights:

ppl <- data.frame(person = letters[1:5], height_cm = v1)
ppl
##   person height_cm
## 1      a       175
## 2      b       182
## 3      c       150
## 4      d       187
## 5      e       165

Let’s add a variable that expresses height in inches instead:

ppl$height_inch <- NA                                     #add variable of NAs
n <- nrow(ppl)                                            #get number of observations to loop over
for(i in 1:n){
    ppl$height_inch[i] <- ppl$height_cm[i] * 0.393701
}
ppl
##   person height_cm height_inch
## 1      a       175    68.89768
## 2      b       182    71.65358
## 3      c       150    59.05515
## 4      d       187    73.62209
## 5      e       165    64.96067

Note that when adding a constant or NA values to a vector within a data frame, R (correctly) assumes that you want to add this constant to every element of the variable, so you don’t need to specify how many times you want to add NA in ppl$height_inch <- NA.

An appliction

For loops can be used to carry out Monte Carlo simulations. In the example below, we’ll draw repeated samples from a population, calculate the mean for each sample, and test whether we on average do a good job of estimating the population mean.

Say the population consists of 10 individuals with the following heights:

v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
mean(v)    #population mean
## [1] 175.6

Unfortunately, for whatever reason, we do not know the heights of all of these individuals. We can only (randomly) sample 5 of them. From this random sample of five individuals we estimate the height of all 10 individuals. We can draw a sample of 5 from v and take the mean of this sample using the following code:

v <- c(175, 182, 150, 187, 165, 177, 200, 198, 157, 165)
smpl <- sample(v, 5)
mean(smpl)
## [1] 173.8

Would we on average expect to estimate the mean of the population accurately? Let’s use a Monte Carlo simulation to find out. We’ll draw 10,000 random samples of five from v and take the mean of each of these samples. With an unbiased estimator we would, on average, expect the sample estimate to equal the population parameter of interest.

n <- 10000
smpl_means <- rep(NA, n)
for(i in 1:n){
    smpl <- sample(v, 5)
    smpl_means[i] <- mean(smpl)
}

mean(smpl_means)
## [1] 175.674

The mean of the sample means (175.67396) is very close to the population mean (175.6): on average, we’re accurately estimating the population mean with our random sample of five individuals.

Note, though, that in some cases our estimate is quite far from the population mean. To illustrate this, we can plot a histogram of the sample means:

require(ggplot2)
ggplot(data.frame(smpl_means), aes(x = smpl_means)) +
    geom_histogram(binwidth = 2) +
    geom_vline(xintercept = mean(v), color = "red", linetype = 2) +
    xlab("Sample mean (n = 5)") +
    ylab("Number of samples") + 
    theme_bw()

The dashed red line indicates the population mean. While our sample estimates are centered around this mean (good news!), the range of the estimates is quite large. In fact, about 10% of the time we’d get an estimate of the mean that is either almost 9 centimeters (3.5 inches) below the actual mean or almost 9 centimeters above the actual mean:

quantile(smpl_means, probs = c(0.05, 0.95))
##    5%   95%
## 166.8 184.4

Exercises

  1. Use a for loop to take the square root of each value in the following vector: vec1 <- c(4, 9, 81, 100, 1000, 10^6). Save the results to a new vector called vec2.

  2. Monte Carlo Simulation: Imagine that the values in the vector pop below represent vote shares for a presidential candidate across the 3,144 counties in the United States. If we were to take a sample of 50 counties and estimate mean support for the presidential candidate, would we, on average, estimate the vote share across all counties accurately? (Don’t worry about the fact that we really should be weighing counties by their population size to estimate overall support.) Draw 10,000 samples of 50 counties from pop and estimate mean support for each sample, saving each mean estimate into a vector called smpl_means. How does the mean of the sample means compare to the population mean? Do we, on average, do a good job of estimating the population mean?

pop <- runif(n = 3144, min = 0, max = 1)