Creating vectors
We’ll create three types of vectors: numeric, character, and logical. Here are examples of each type:
# Numeric vectors
n1 <- 20
n2 <- c(20, 25, 60, 55)
# Character vectors
c1 <- "Blue"
c2 <- c("Red", "Green", "Purple")
# Logical vectors
l1 <- TRUE
l2 <- c(TRUE, FALSE, TRUE)
Note that vectors can consist of one or many elements. Three common ways to create vectors with more than one element is to use c()
, seq()
, or rep()
.
c()
As illustrated above, one very common way to create vectors with more than one element is to use c()
(“concatenate”), which simply combines whatever values you specify in the parentheses.
seq()
seq()
applies to numeric vectors only:
n1 <- seq(from = 0, to = 10, by = 2) #using 'by'
n1
## [1] 0 2 4 6 8 10
n2 <- seq(from = 0, to = 10, length.out = 5) #using 'length.out'
n2
## [1] 0.0 2.5 5.0 7.5 10.0
n3 <- seq(1, 2, 0.1) #no argument names specified; automatically uses 'from', 'to', 'by'
n3
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
n4 <- 1:5 #shortcut for integer sequence; same as 'seq(1, 5, 1)'
n4
## [1] 1 2 3 4 5
seq()
by default takes three parameters: starting value, end value, and a value that specifies how elements will be incremented (“by”), which can be substituted with “length.out”. Integer sequences can be created using a colon.
General note about argument names: A function’s argument names need not be specified, as illustrated when we created n3
. If they are not specified, R uses arguments based on a default order. One way to learn about this order is to use ?
(e.g., ?seq
). If you specify the argument names, the order doesn’t matter. Putting all this together, hopefully it’s obvious why seq(by = -1, from = 10, to = 2)
is the same as seq(10, 2, -1)
.
rep()
Vectors can also be created using rep()
. As the name implies, this function is useful if you want to repeat an element or elements.
rep(1, 5)
## [1] 1 1 1 1 1
rep("blue", 3)
## [1] "blue" "blue" "blue"
rep(TRUE, 4)
## [1] TRUE TRUE TRUE TRUE
As should be obvious, the first parameter in the function specifies the element to repeat, and the second the number of times to repeat it.
Using more than one function
Perhaps the most powerful use of these functions comes from combining them. Here are a two examples:
rep(c("blue", "red"), 3)
## [1] "blue" "red" "blue" "red" "blue" "red"
c(rep(seq(0, 6, 2), 2), 4:1)
## [1] 0 2 4 6 0 2 4 6 4 3 2 1
The second example is somewhat hard to follow, and is probably at the limit of complexity in terms of how many functions we want to combine. Separating a task into multiple lines of codes can help.
s <- rep(seq(0, 6, 2), 2)
c(s, 4:1)
## [1] 0 2 4 6 0 2 4 6 4 3 2 1
Subsetting vectors
Extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process — whether applied to a vector or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.
In R, there are three ways to filter a vector: using a separate logical vector, using indexing, and using names. I tend to use the first method most, but all three are useful.
Subsetting with logicals
Let’s jump right into an example. Say we have a character vector with only two elements (“apple” and “banana”). Subsetting it to “apple” could be done like so:
fruits <- c("apple", "banana")
fruits[c(TRUE, FALSE)]
## [1] "apple"
Note the use of brackets, []
— this is common when filtering. Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with TRUE
are kept while elements that match with FALSE
are dropped.
This process is extremely useful when combined with a logical operation. Please familiarize yourself with the logical operations listed here. For example, using a logical operation we can filter a large vector of oranges, apples and bananas:
# Create a vector with 30 fruits
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits
## [1] "orange" "apple" "banana" "orange" "apple" "banana" "orange"
## [8] "apple" "banana" "orange" "apple" "banana" "orange" "apple"
## [15] "banana" "orange" "apple" "banana" "orange" "apple" "banana"
## [22] "orange" "apple" "banana" "orange" "apple" "banana" "orange"
## [29] "apple" "banana"
# Create a logical vector for dropping bananas
# Note: I'm creating the exact same logical vector three times (overriding it each time)
# This is for illustrative purposes; using one of these is sufficient
lv <- fruits == "orange" | fruits == "apple"
lv <- fruits != "banana"
lv <- fruits %in% c("orange", "apple")
lv
## [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [12] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [23] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
# Carry out the subset
fruits[lv]
## [1] "orange" "apple" "orange" "apple" "orange" "apple" "orange"
## [8] "apple" "orange" "apple" "orange" "apple" "orange" "apple"
## [15] "orange" "apple" "orange" "apple" "orange" "apple"
We applied the same logic as above: We have a vector (fruits
) that we want to subset. We do so using a logical vector (lv
), where elements that match with TRUE
are kept. The only difference here is that we create the logical vector with a logical operation. The logical operators (e.g., !=
, |
) used here are discussed in the link above, with the exception of %in%
.
General note about %in%
: This operator is extremely useful as an alternative for repeated “or” (|
) statements. For example, say you have a vector with 10 types of fruits and you want to keep elements that are equal to “orange”, “apple”, “mango”, “mandarin”, or “kiwi”. You could accomplish this by creating a logical vector like so: lv <- fruits == "orange" | fruits == "apple" | fruits == "mango" | fruits == "mandarin" | fruits == "kiwi"
.
What a nighmarishly long statement compared to the %in%
option that accomplishes the exact same thing: lv <- fruits %in% c("orange", "apple", "mango", "mandarin", "kiwi")
.
Of course, subsetting using logicals can also be done on numeric vectors. Here are a few examples:
# Create a numeric vector
numbers <- seq(0, 100, by = 10)
numbers
## [1] 0 10 20 30 40 50 60 70 80 90 100
# Illustrate three different filters
numbers[numbers <= 50 & numbers != 30]
## [1] 0 10 20 40 50
numbers[numbers == 0 | numbers == 100]
## [1] 0 100
numbers[numbers > 100] #returns an empty vector
## numeric(0)
Note that I didn’t create logical objects to carry out the subsets here, as opposed to above where we explicitly defined lv
. I find it more compact and intuitive to take subsets without first creating a logical vector.
Subsetting using indexing
A different way to subset a vector is to specify the index or indeces you want to keep, again using brackets. Here are a few examples:
fruits <- c("apple", "banana")
fruits[1]
## [1] "apple"
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[c(10, 20)]
## [1] "orange" "apple"
fruits[seq(1, 30, by = 5)]
## [1] "orange" "banana" "apple" "orange" "banana" "apple"
I sometimes use this when I want to inspect or modify an element that I know occurs at a specific index in the vector, a more manual approach than using logical statements.
Subsetting using indexing can also be used in random sampling, which has many important applications — for example, in experiments and when you want to test-run code on a representative subset of your data. So, let’s introduce the sample()
function:
# Draw 10 elements at random from 1 to 100
sample(1:100, size = 10)
## [1] 32 60 48 58 90 50 72 100 62 46
The function takes a vector of values (often successive integer values) and an argument that specifies how many values to draw at random from this vector. We can use the resulting values as indeces to subset another vector:
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[sample(1:30, size = 5)]
## [1] "orange" "apple" "orange" "orange" "orange"
Here, we’re drawing a random sample of five elements from the vector fruits
. Why did I specify 1:30
? Well, fruits
consists of 30 elements, so specifying something like 1:100
likely would have resulted in sampled values outside the bounds of the vector (e.g., fruits[35]
doesn’t exist). Specifying 1:30
gives every element in fruits
an equal chance of being included in the sample.
Subsetting using names
Lastly, we can assign names to each element in a vector and take a subset based on the names.
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
age #note that values now have names
## mom dad grandpa
## 50 55 80
age[c("dad", "grandpa")] #subset
## dad grandpa
## 55 80
That is, we have a vector representing the age of three family members. We assign names to each value, and then keep the values associated with two of the family members.
Modifying vectors
The subsetting logic from above can be used to modify vectors. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered grandpa’s age above? We can fix it using indexing, a logical statement, or naming.
# Recreate vector with age values from above
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
# Three ways of changing grandpa's age
# Note: you'd only need to use one of these
age[age == 80] <- 82 #using a logical statement
age[3] <- 82 #using indexing
age["grandpa"] <- 82 #using naming
age
## mom dad grandpa
## 50 55 82
A logical statement is most efficient when we need to change a lot of elements.
fruits <- rep(c("orange", "apple", "bamama"), 5)
fruits #bamamas anyone?
## [1] "orange" "apple" "bamama" "orange" "apple" "bamama" "orange"
## [8] "apple" "bamama" "orange" "apple" "bamama" "orange" "apple"
## [15] "bamama"
# Let's fix the misspelled element
fruits[fruits == "bamama"] <- "banana"
fruits
## [1] "orange" "apple" "banana" "orange" "apple" "banana" "orange"
## [8] "apple" "banana" "orange" "apple" "banana" "orange" "apple"
## [15] "banana"
Vector arithmetics
We can modify or create new numeric vectors using arithmetic operations. Three common types of operations involve:
- A vector with more than one element and a vector with only one element.
- Two vectors with the same number of elements. Elements are matched based on index.
- A vector modified by a function.
In all cases, we can modify all elements of a vector or only a subset of elements using the bracket notation we learned above.
numbers <- 1:10
numbers
## [1] 1 2 3 4 5 6 7 8 9 10
# One value modifying all values in a vector
numbers <- numbers / 10
numbers
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# One value modifying a subset of a vector
numbers[numbers > 0.5] <- numbers[numbers > 0.5] * 100
numbers
## [1] 0.1 0.2 0.3 0.4 0.5 60.0 70.0 80.0 90.0 100.0
# Two vectors with the same number of elements
numbers1 <- 1:10
numbers2 <- 10:1
numbers3 <- numbers2 - numbers1
numbers3
## [1] 9 7 5 3 1 -1 -3 -5 -7 -9
# Replacing a subset of a vector using another vector
numbers <- 1:10
numbers[numbers > 5] <- 5:1
numbers
## [1] 1 2 3 4 5 5 4 3 2 1
# Modify a vector (or a subset of a vector) using a function
numbers <- 1:10
sqrt(numbers) #square root
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
exp(numbers) #exponentiate
## [1] 2.718282 7.389056 20.085537 54.598150 148.413159
## [6] 403.428793 1096.633158 2980.957987 8103.083928 22026.465795
log(numbers[c(1, 5, 10)]) #natural log
## [1] 0.000000 1.609438 2.302585
Vector arithmetics can also be carried out in R on two multi-value vectors with different number of elements. Such operations use the recycling rule.
Summarizing vectors
We often want to get summary statistics from a vector — that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for vectors:
numbers <- sample(1:1000, 10)
numbers
## [1] 301 797 560 949 335 472 556 566 990 358
class(numbers) #check the class
## [1] "integer"
length(numbers) #number of elements
## [1] 10
max(numbers) #maximum value
## [1] 990
min(numbers) #minimum value
## [1] 301
sum(numbers) #sum of all values in the vector
## [1] 5884
mean(numbers) #mean
## [1] 588.4
median(numbers) #median
## [1] 558
var(numbers) #variance
## [1] 61181.16
sd(numbers) #standard deviation
## [1] 247.3482
quantile(numbers) #percentiles in intervals of .25
## 0% 25% 50% 75% 100%
## 301.00 386.50 558.00 739.25 990.00
quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 301.0 331.6 353.4 437.8 522.4 558.0 562.4 635.3 827.4 953.1 990.0
summary(numbers) #function that contains many summary stats from above
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 301.0 386.5 558.0 588.4 739.2 990.0
If you forget one of these functions or if I haven’t included one here that you need, google almost surely has the answer for you. Also note that some of the operations above — most notably class()
and length()
— apply to non-numeric vectors.
Code generalization
We want our code to be as general as possible so that it can be reapplied to a different coding task or if the data change. Commands that summarize vectors can be useful to accomplish this.
Remember above when we found a random sample of fruits
? Here is more or less the code we used:
fruits <- rep(c("orange", "apple", "banana"), 10)
length(fruits)
## [1] 30
random_sample <- fruits[sample(1:30, size = 5)]
random_sample
## [1] "orange" "banana" "apple" "orange" "apple"
The third line, where we create random_sample
, is not very general. Why? In this case, fruits
has 30 elements. What if it instead had 50 elements? Then the third line would not give us a random sample. Or more precisely, this line would give us a random sample of the 30 first elements of fruits — the last 20 elements would not have a chance of being included. We could modify the third line to read random_sample <- fruits[sample(1:50, size = 5)]
. But if we then modified fruits
to have a different number of elements again we’d end up with the same problem.
Here’s the solution: find the number of elements of fruits using length()
and then input this as an argument in the sample()
function.
fruits <- rep(c("orange", "apple", "banana"), 100)
n <- length(fruits) #store the result of length() in an object
n
## [1] 300
random_sample <- fruits[sample(1:n, size = 5)] #now use 'n' in the sample() function
random_sample
## [1] "orange" "orange" "orange" "banana" "orange"
# Or we could have used length() directly in the sample() function
# Note: Accomplishes the same thing as first creating 'n'
random_sample <- fruits[sample(1:length(fruits), size = 5)]
Exercises
Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.
What is the mean age of the people in your vector? Find out in two ways, with and without using the
mean()
command.How old is the youngest person in your vector? (Use an R command to find out.)
What is the age gap between the youngest person and the oldest person in your vector? (Again use R to find out, and try to be as general as possible in the sense that your code should work even if the elements in your vector, or their order, change.)
How many people in your vector are above age 25? (Again, try to make your code work even in the case that your vector changes.)
Replace the age of the oldest person in your vector with the age of someone else you know.
Create a new vector that indicates how old each person in your vector will be in 10 years.
Create a new vector that indicates what year each person in your vector will turn 100 years old.
Create a new vector with a random sample of 3 individuals from your original vector. What is the mean age of the people in this new vector?