R
is an incredibly powerful tool for for data management and analysis.
Data management involves
Data analysis helps us assess what the data we’ve compiled tell us about the world. To do so, we can use descriptive statistics, graphs, or multivariate statistical tests.
As an example of data analysis, here’s a graph generated in R
(using the ggvis package). It shows how the weight, number of cyllinders, and miles per gallon of 32 different car models are related.
You’ll be able to create similar graphs soon. But first, let’s jump into the basics of R
.
Four common object types that store data are:
Scalars: store a single numeric value.
Strings: store a set of one or more characters.
Vectors: store several scalar or string elements.
Data Frames. Store several vectors (meaning that they contain several rows and columns).
To create any object in R
, we use the assignment operator <-
.
The following line of code creates a scalar named a
that stores the value 9:
a <- 9
This scalar can then be used create a different scalar b
:
b <- a + 1
b
## [1] 10
Any type of object can be overwritten. What value does b
contain after running the following command?
b <- b - a
Objects need not be numeric. The following code creates an object c
of mode character rather than numeric:
c <- "Hello world"
We call the object c
a “string”. To check the class of an object, use the class()
command.
class(a)
## [1] "numeric"
class(c)
## [1] "character"
Vectors have several elements (whether numeric or non-numeric), and are created in the following way:
v <- c(1, 2, 3, 4)
c stands for “concatenate.” The object v
now contains the vector {1, 2, 3, 4}, as can be seen if you call the object:
v
## [1] 1 2 3 4
A shortcut for creating a vector with an integer sequence is:
v <- 1:4
v
## [1] 1 2 3 4
Non-integer sequences can be created using the seq()
command:
v <- seq(from = 0, to = 0.5, by = 0.1)
v
## [1] 0.0 0.1 0.2 0.3 0.4 0.5
You can also use scalar objects to create vectors:
a <- 0
b <- 22
v <- c(b, 1:4, a)
v
## [1] 22 1 2 3 4 0
Use the length()
and mean()
commands to find out how many elements a vector contains and the mean of a vector, respectively:
v <- c(1, 2, 3, 4)
length(v)
## [1] 4
mean(v)
## [1] 2.5
Of course, these values can themselves be stored as scalar objects:
length_v <- length(v)
mean_v <- mean(v)
Again, vectors need not contain numerical values. The following line of code creates a vector of strings:
v_colors <- c("blue", "yellow", "light green")
v_colors
## [1] "blue" "yellow" "light green"
Getting a particular element or elements of a vector can be very useful. This is done using brackets.
v_colors[2] #get the second element of v_colors
## [1] "yellow"
v_colors[c(1, 3)] #get the first and third elements
## [1] "blue" "light green"
You can also reassign an element or elements of a vector:
v_colors[2:3] <- c("red", "purple")
v_colors
## [1] "blue" "red" "purple"
Data frames are extremely useful. Think of them as datasets where each column represents a variable and each row represents a unit of observation.
To create a data frame, it is useful to go through two steps:
Create the vectors (variables) that you want the data frame to contain.
Piece these vectors together using the data.frame()
command.
For example, say I wanted to create a data frame containing information about students’ name, height (in centimeters), and GPA:
name <- c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort")
height <- c(176, 175, 167, 230, 180)
gpa <- c(3.4, 2.8, 4.0, 2.2, 3.4)
df_students <- data.frame(name, height, gpa) #piecing vectors together
The output in the console if we execute df_students
now is:
df_students
## name height gpa
## 1 Harry 176 3.4
## 2 Ron 175 2.8
## 3 Hermione 167 4.0
## 4 Hagrid 230 2.2
## 5 Voldemort 180 3.4
df_students
is a data frame with three vectors. Put differently, df_students
is a dataset with three variables: name
(a nominal variable), height
(a continuous variable), and gpa
(a continuous variable), where the unit of observation is a student/individual.
We can create data frames without first creating vectors. The following creates the same data frame (df_students
) as above:
df_students <- data.frame(name = c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort"),
height = c(176, 175, 167, 230, 180),
gpa = c(3.4, 2.8, 4.0, 2.2, 3.4))
df_students
## name height gpa
## 1 Harry 176 3.4
## 2 Ron 175 2.8
## 3 Hermione 167 4.0
## 4 Hagrid 230 2.2
## 5 Voldemort 180 3.4
Say we wanted to add a dummy variable (also called indicator variable) that equals 1 if the individual is good and 0 if he or she is evil. We can do this using the $
operator:
df_students$good <- c(1, 1, 1, 1, 0)
df_students
## name height gpa good
## 1 Harry 176 3.4 1
## 2 Ron 175 2.8 1
## 3 Hermione 167 4.0 1
## 4 Hagrid 230 2.2 1
## 5 Voldemort 180 3.4 0
To get the dimensions of a data frame, use the dim()
command:
dim(df_students)
## [1] 5 4
The data frame df_students
has 5 rows and 4 columns.
Again, we can get particular elements or set of elements of a data frame using brackets. The first number indicates the row and the second the column of the data frame:
df_students[2, 3] #Ron's GPA
## [1] 2.8
You could also use
df_students$gpa[2] #Ron's GPA
## [1] 2.8
We can get a full row or set of rows by leaving out the column number:
df_students[5, ]
## name height gpa good
## 5 Voldemort 180 3.4 0
df_students[3:5, ]
## name height gpa good
## 3 Hermione 167 4.0 1
## 4 Hagrid 230 2.2 1
## 5 Voldemort 180 3.4 0
Likewise, we can get a full column or set of columns:
df_students[, 2]
## [1] 176 175 167 230 180
df_students$height
## [1] 176 175 167 230 180
df_students[, 1:3]
## name height gpa
## 1 Harry 176 3.4
## 2 Ron 175 2.8
## 3 Hermione 167 4.0
## 4 Hagrid 230 2.2
## 5 Voldemort 180 3.4
As with vectors, we can reassign a given element or elements:
df_students[4, 2] <- 255 #reassign Hagrid's height
df_students$height[4] <- 255 #same thing as above
df_students
## name height gpa good
## 1 Harry 176 3.4 1
## 2 Ron 175 2.8 1
## 3 Hermione 167 4.0 1
## 4 Hagrid 255 2.2 1
## 5 Voldemort 180 3.4 0
R
:
v
, then you can add a number c to each element of that vector using v + c
.name
of each of your family members, their age
, and their gender
.
R
to find the class of each of these variables.R
to answer this.)Note: For the next part of the tutorial we’ll be working with a dataset called world_small.csv
, which you can download here.
You now know how to create some simple but common data objects in R
. Oftentimes, however, we’ll want to read an existing dataset into R
. Datasets come in many formats—e.g., .csv
, .txt
, .dta
, .RData
, and online data structures (HTML tables). R
can read most data formats as is, but sometimes it may be necessary to manually reformat some elements in the file or even to reconvert the whole file to a different format (e.g., using Stat/Transfer
). For now, we’ll assume that the file is in a readable format.
To read a file you need to
Specify where the file is located on your computer. This is referred to as setting your working directory.
Execute a command that will read the file from your working directory.
You can set your working directory manually. In RStudio, go to Session –> Set Working Directory –> Choose Directory… and find the folder in which your file is located.
While this works, good coding practice requires that you always also include a line of code that sets the working directory in the beginning of your .R file when you need to read a file. To do this, use the command setwd(path-to-dir)
where path-to-dir
is the the path to the folder in which the file is located. One way to find this path is to set your working directory manually first. The path to the directory then shows up in the R
console.
To set my working directory for world_small.csv
, I include the following line somewhere in the beginning of my .R file:
setwd("~/dropbox/155/tutorial1")
Note that the path to your working directory may look different than mine, and that in Windows you may see back slashes instead of forward slashes.
Now that we’ve told R
where to look for the file we want to read, it’s time to actually read the file. Different commands are used to read different types of files. This is the syntax used for reading a .csv
file:
world <- read.csv("world_small.csv")
I’m reading the file from the working directory and assigning it to an object called world
, which becomes of class “data.frame”:
class(world)
## [1] "data.frame"
Let’s check if the file was read correctly, using dim()
(returns the dimensions), head()
(returns the top six rows), and summary()
(returns summary information about each variable):
dim(world)
## [1] 145 4
head(world) #same as: world[1:6, ]
## country region gdppcap08 polityIV
## 1 Albania C&E Europe 7715 17.8
## 2 Algeria Africa 8033 10.0
## 3 Angola Africa 5899 8.0
## 4 Argentina S. America 14333 18.0
## 5 Armenia C&E Europe 6070 15.0
## 6 Australia Asia-Pacific 35677 20.0
summary(world)
## country region gdppcap08 polityIV
## Albania : 1 Africa :42 Min. : 188 Min. : 0.00
## Algeria : 1 C&E Europe :25 1st Qu.: 2153 1st Qu.: 7.67
## Angola : 1 Asia-Pacific:24 Median : 7271 Median :16.00
## Argentina: 1 S. America :19 Mean :13252 Mean :13.41
## Armenia : 1 Middle East :16 3rd Qu.:19330 3rd Qu.:19.00
## Australia: 1 W. Europe :12 Max. :85868 Max. :20.00
## (Other) :139 (Other) : 7
Everything looks as we would have hoped.
world_small.csv
file from a directory on your computer. Put it in a directory that will allow you to keep your files organized throughout the quarter (not on your Desktop).R
is open-source, meaning that anyone can write a package that extends its functionality. We’ll make use of many packages in this class. To use a package you must
R
, it will no longer have packages you loaded in memory the next time you open it. If you want to use a package, you therefore need to load it again after closing down R
.To install packages plyr
, dplyr
, and ggplot2
, run
install.packages(c("plyr", "dplyr", "ggplot2"), dep = T)
To load these packages:
require(plyr)
require(dplyr)
require(ggplot2)
Alternative way of loading packages that is more compact:
sapply(c("plyr", "dplyr", "ggplot2"), require, character.only = T)
plyr dplyr ggplot2
TRUE TRUE TRUE
plyr
, dplyr
, and ggplot2
.There are many ways to subset data frames in R
. Here are three ways to subset the world data frame to countries in Africa:
setwd("~/dropbox/155/tutorial1")
world <- read.csv("world_small.csv")
afr1 <- world[world$region == "Africa", ] #option 1: use brackets
dim(afr1)
## [1] 42 4
afr1 <- subset(world, region == "Africa") #option 2: use subset()
dim(afr1)
## [1] 42 4
require(dplyr)
afr1 <- filter(world, region == "Africa") #option 3: use filter() from package dplyr
dim(afr1)
## [1] 42 4
Subset to African countries with a polity score of at least 15:
afr2 <- world[world$region == "Africa" & world$polityIV >= 15, ] #option 1
afr2 <- subset(world, region == "Africa" & polityIV >= 15) #option 2
afr2 <- filter(world, region == "Africa", polityIV >= 15) #option 3
Same as above, keeping only variables “country” and “polityIV”
afr3 <- world[world$region == "Africa" & world$polityIV >= 15, c(1, 4)] #option 1
afr3 <- subset(world, region == "Africa" & polityIV >= 15, #option 2
select = c("country", "polityIV"))
afr3 <- filter(world, region == "Africa", polityIV >= 15) %>% #option 3
select(country, polityIV)
Notes about how we produced these subsets:
==
, &
and >=
.==
) and a single equal sign (=
) is important. A single equal sign is equivalent to the assignment operator, so a <- 3
and a = 3
does the same thing. On the other hand, a == 3
tests whether a
is equal to 3 and returns either TRUE or FALSE, given that a
has been defined.The most common way to add a variable to a dataset is to use the $
operator followed by a new variable name:
world$gdp_log <- log(world$gdppcap08) #add logged gdp per cap variable
world$democ <- ifelse(world$polityIV > 10, 1, 0) #create democracy dummy variable
head(world)
## country region gdppcap08 polityIV gdp_log democ
## 1 Albania C&E Europe 7715 17.8 8.951 1
## 2 Algeria Africa 8033 10.0 8.991 0
## 3 Angola Africa 5899 8.0 8.683 0
## 4 Argentina S. America 14333 18.0 9.570 1
## 5 Armenia C&E Europe 6070 15.0 8.711 1
## 6 Australia Asia-Pacific 35677 20.0 10.482 1
We used ifelse()
to create the variable democ
, a dummy variable that equals 1 if a country has a Polity IV score above 10 and 0 otherwise.
ifelse
works in the following way:
R
what to do when the test is TRUE. In this case, R
assigns a ‘1’ to the variable democ
.R
what to do when the test is FALSE. In this case, R
assigns a ‘0’ to the variable democ
.mutate()
from package dplyr
is another way to create new variables. Using the data management functions included in dplyr
has many advantages (see below).
The following code accomplishes the same thing as the code above using mutate()
:
world <- mutate(world, gdp_log = log(gdppcap08),
democ = ifelse(polityIV > 10, 1, 0))
head(world)
## country region gdppcap08 polityIV gdp_log democ
## 1 Albania C&E Europe 7715 17.8 8.951 1
## 2 Algeria Africa 8033 10.0 8.991 0
## 3 Angola Africa 5899 8.0 8.683 0
## 4 Argentina S. America 14333 18.0 9.570 1
## 5 Armenia C&E Europe 6070 15.0 8.711 1
## 6 Australia Asia-Pacific 35677 20.0 10.482 1
In world
, the variable region
is a factor variable:
class(world$region)
## [1] "factor"
Factor variables are an important class of variables. They are simply categorical variables. Here are the categories (called “levels” in R
) of region
:
levels(world$region)
## [1] "Africa" "Asia-Pacific" "C&E Europe" "Middle East"
## [5] "N. America" "S. America" "Scandinavia" "W. Europe"
While this variable could have been stored in character mode, storing it as a facor makes life easier. For example, note that region
splits European countries into three categories: “C&E Europe”, “Scandinavia”, and “W. Europe”. Factor variables are easy to recode. Let’s create a new region variable that groups all European countries together:
world$region2 <- world$region #create new region variable identical to 'region'
levels(world$region2) <- c("Africa", "Asia-Pacific", "Europe", "Middle East",
"N. America", "S. America", "Europe", "Europe") #relevel
table(world$region) #number of countries by region (original variable)
##
## Africa Asia-Pacific C&E Europe Middle East N. America
## 42 24 25 16 3
## S. America Scandinavia W. Europe
## 19 4 12
table(world$region2) #number of countries by region (recoded variable)
##
## Africa Asia-Pacific Europe Middle East N. America
## 42 24 41 16 3
## S. America
## 19
Note that the number of European countries in region2
is equal to the combined number of European countries (“C&E Europe”, “Scandinavia”, and “W. Europe”) in region
.
The easiest way to re-order a data frame is to use arrange()
from dplyr
. The function takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
head(world) #original order
## country region gdppcap08 polityIV gdp_log democ region2
## 1 Albania C&E Europe 7715 17.8 8.951 1 Europe
## 2 Algeria Africa 8033 10.0 8.991 0 Africa
## 3 Angola Africa 5899 8.0 8.683 0 Africa
## 4 Argentina S. America 14333 18.0 9.570 1 S. America
## 5 Armenia C&E Europe 6070 15.0 8.711 1 Europe
## 6 Australia Asia-Pacific 35677 20.0 10.482 1 Asia-Pacific
world <- arrange(world, gdppcap08) #order by gdp per cap
head(world)
## country region gdppcap08 polityIV gdp_log democ region2
## 1 Zimbabwe Africa 188 6.00 5.236 0 Africa
## 2 Congo Kinshasa Africa 321 15.00 5.771 1 Africa
## 3 Liberia Africa 388 10.00 5.961 0 Africa
## 4 Guinea-Bissau Africa 538 11.00 6.288 1 Africa
## 5 Eritrea Africa 632 3.00 6.449 0 Africa
## 6 Niger Africa 684 15.33 6.528 1 Africa
world <- arrange(world, desc(gdppcap08)) #order by gdp per cap (descending)
head(world)
## country region gdppcap08 polityIV gdp_log democ region2
## 1 Qatar Middle East 85868 0 11.36 0 Middle East
## 2 Norway Scandinavia 58138 20 10.97 1 Europe
## 3 Singapore Asia-Pacific 49284 8 10.81 0 Asia-Pacific
## 4 United States N. America 46716 20 10.75 1 N. America
## 5 Ireland W. Europe 44200 20 10.70 1 Europe
## 6 Switzerland W. Europe 42536 20 10.66 1 Europe
world <- arrange(world, region, country) #order by region, then country
head(world)
## country region gdppcap08 polityIV gdp_log democ region2
## 1 Algeria Africa 8033 10.0 8.991 0 Africa
## 2 Angola Africa 5899 8.0 8.683 0 Africa
## 3 Benin Africa 1468 16.2 7.292 1 Africa
## 4 Botswana Africa 13392 19.0 9.502 1 Africa
## 5 Burkina Faso Africa 1161 10.0 7.057 0 Africa
## 6 Cameroon Africa 2215 6.0 7.703 0 Africa
dplyr
makes it possible to write beautiful, fast code that combines different data management tasks using the %>%
(piping) operator. Say we wanted to
world
to South American countries with a polity score above 10,With dplyr
, our code might look something like this:
samr <- world %>%
filter(region == "S. America", polityIV > 10) %>% #subset
mutate(gdp_log = log(gdppcap08), #create new variables
democ = ifelse(polityIV > 10, 1, 0)) %>%
select(country, gdppcap08, gdp_log, democ) %>% #keep four variables
arrange(desc(gdp_log)) #sort based on logged gdp
samr
## country gdppcap08 gdp_log democ
## 1 Chile 14465 9.579 1
## 2 Argentina 14333 9.570 1
## 3 Venezuela 12804 9.458 1
## 4 Uruguay 12734 9.452 1
## 5 Costa Rica 11241 9.327 1
## 6 Brazil 10296 9.240 1
## 7 Colombia 8885 9.092 1
## 8 Peru 8507 9.049 1
## 9 Ecuador 8009 8.988 1
## 10 Jamaica 7705 8.950 1
## 11 El Salvador 6794 8.824 1
## 12 Guatemala 4760 8.468 1
## 13 Paraguay 4709 8.457 1
## 14 Bolivia 4278 8.361 1
## 15 Honduras 3965 8.285 1
## 16 Nicaragua 2682 7.894 1
## 17 Guyana 2542 7.841 1
Here’s a short explanation:
world
on the first line, every subsequent line of code will operate on that data frame.R
to execute each line from top to bottom and update world
accordingly.samr
.I recommend using this piping functionality whenever you can.
world
:
polityIV
from 0-20 to -10-10.gdppcap08
you think is reasonable.world
to European countries.region
variable (keep the rest).dplyr
’s piping functionality.Many, many times when coding you’ll have an idea of what you want to do but won’t know how to do it in R
. This happens even for experienced coders. With the right strategies, you’ll be able to solve a majority of issues you run into yourself. Not having to ask someone else every time you run into a problem will save you a lot of time.
When you’re stuck, consult class material (handouts, textbook, etc.). Perhaps more efficiently, google what you’re trying to do. For example, if you want to find the mean of a variable, try googling “how to find mean in R” and there likely will be tons of explanations of how to do this.
R
also has a nifty help feature that is called using the following syntax: ?commandname
, where commandname
is the name of the command that you need help with. For example, ?mean
will bring up a help dialog box with information about how to use R
’s mean()
command.
R
command to find the mean of the vector x
defined below, ignoring NA
values. If you try mean(x)
, R
will return NA
, and we don’t want this. Do not change the state of the vector in any way. (Use google and/or ?mean
to figure this one out.)x <- sample(c(rep(NA, 200), runif(800)), 500)
Link to .R file with code used in this tutorial (with minimimal commenting)