Overview
The table below gives an overview of very common data management tasks and their corresponding dplyr function.
Task | Function |
---|---|
*Covered in a separate tutorial. | |
Subsetting |
filter()
|
Sorting |
arrange()
|
Dropping variables |
select()
|
Renaming variables |
rename()
|
Extracting unique values |
distinct()
|
Creating new variables |
mutate()
|
Collapsing data* |
group_by() and summarize()
|
All of these tasks can also be accomplished using R’s base functions, but this usually requires more involved code. An additional advantage of using dplyr functions is that they can be combined in an elegant way using chaining.
To demonstrate these functions, we’ll rely on a small data frame called countries
that you can create as follows:
countries <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
countries$country <- as.character(countries$country) #factor --> character
countries
## country year gdp_pc
## 1 USA 1994 9519
## 2 China 1994 8898
## 3 Sudan 1994 19708
## 4 USA 1995 14481
## 5 China 1995 15620
## 6 Sudan 1995 4375
## 7 USA 1996 6387
## 8 China 1996 2893
## 9 Sudan 1996 16812
We’ll also make use of world-small.csv
, which you can download here.
Before we begin, let’s also load two packages we’ll need. (Of course, they need to be installed first — remind yourself here.)
require(plyr)
## Loading required package: plyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Subsetting
The best way to subset a dataset is by using a logical statement. The idea is that we want to keep only some rows in a dataset that meet some logical condition.
Say we wanted to subset the countries
dataset to (1) a certain country, (2) a certain year, (3) a certain country and year, or (4) a certain GDP range. Here’s how we can do this using filter()
.
filter(countries, country == "China")
## country year gdp_pc
## 1 China 1994 8898
## 2 China 1995 15620
## 3 China 1996 2893
filter(countries, year == 1996)
## country year gdp_pc
## 1 USA 1996 6387
## 2 China 1996 2893
## 3 Sudan 1996 16812
filter(countries, country == "USA" & year == 1995)
## country year gdp_pc
## 1 USA 1995 14481
filter(countries, gdp_pc > 5000 & gdp_pc < 14000)
## country year gdp_pc
## 1 USA 1994 9519
## 2 China 1994 8898
## 3 USA 1996 6387
As should be obvious, the first argument in the filter()
function specifies the dataset on which to carry out the operation. The second argument specifies the logical operation used to filter the data.
In fact, the first argument in all dplyr functions is a dataset. Using dplyr we can move this first argument outside the function and use the special operator %>%
to chain statements. The benefits of this will become more obvious in the section on chaining. The code below accomplishes the same thing as that above using %>%
and moving the data frame outside the filter()
function.
countries %>% filter(country == "China")
## country year gdp_pc
## 1 China 1994 8898
## 2 China 1995 15620
## 3 China 1996 2893
countries %>% filter(year == 1996)
## country year gdp_pc
## 1 USA 1996 6387
## 2 China 1996 2893
## 3 Sudan 1996 16812
countries %>% filter(country == "USA" & year == 1995)
## country year gdp_pc
## 1 USA 1995 14481
countries %>% filter(gdp_pc > 5000 & gdp_pc < 14000)
## country year gdp_pc
## 1 USA 1994 9519
## 2 China 1994 8898
## 3 USA 1996 6387
One last note that applies throughout most of this tutorial: I’m not saving the subsets to any object here. You would normally want to do this. (I’m not doing it here so that you can see the output of the function without extra lines of code.) Just to be clear, if you wanted to take a subset and save the result into an object you’d use the assignment operator as usual.
china <- countries %>% filter(country == "China")
china
## country year gdp_pc
## 1 China 1994 8898
## 2 China 1995 15620
## 3 China 1996 2893
Sorting
Use arrange()
to sort a dataset. Here are a few examples. (I’ll use the chaining operator %>%
from now on — if you’re confused about this remember that you can just bring the data frame countries
inside the function.)
# Sort by country names
countries %>% arrange(country)
## country year gdp_pc
## 1 China 1994 8898
## 2 China 1995 15620
## 3 China 1996 2893
## 4 Sudan 1994 19708
## 5 Sudan 1995 4375
## 6 Sudan 1996 16812
## 7 USA 1994 9519
## 8 USA 1995 14481
## 9 USA 1996 6387
# Sort by GDP (ascending is default)
countries %>% arrange(gdp_pc)
## country year gdp_pc
## 1 China 1996 2893
## 2 Sudan 1995 4375
## 3 USA 1996 6387
## 4 China 1994 8898
## 5 USA 1994 9519
## 6 USA 1995 14481
## 7 China 1995 15620
## 8 Sudan 1996 16812
## 9 Sudan 1994 19708
# Sort by GDP (descending)
countries %>% arrange(desc(gdp_pc))
## country year gdp_pc
## 1 Sudan 1994 19708
## 2 Sudan 1996 16812
## 3 China 1995 15620
## 4 USA 1995 14481
## 5 USA 1994 9519
## 6 China 1994 8898
## 7 USA 1996 6387
## 8 Sudan 1995 4375
## 9 China 1996 2893
# Sort by country name, then GDP
countries %>% arrange(country, gdp_pc)
## country year gdp_pc
## 1 China 1996 2893
## 2 China 1994 8898
## 3 China 1995 15620
## 4 Sudan 1995 4375
## 5 Sudan 1996 16812
## 6 Sudan 1994 19708
## 7 USA 1996 6387
## 8 USA 1994 9519
## 9 USA 1995 14481
Dropping
To only keep some variables in a data frame use select()
:
# Keep country and GDP
countries %>% select(country, gdp_pc)
## country gdp_pc
## 1 USA 9519
## 2 China 8898
## 3 Sudan 19708
## 4 USA 14481
## 5 China 15620
## 6 Sudan 4375
## 7 USA 6387
## 8 China 2893
## 9 Sudan 16812
# Same thing using '-', implying you want to delete a variable
countries %>% select(-year)
## country gdp_pc
## 1 USA 9519
## 2 China 8898
## 3 Sudan 19708
## 4 USA 14481
## 5 China 15620
## 6 Sudan 4375
## 7 USA 6387
## 8 China 2893
## 9 Sudan 16812
# Selecting and renaming in one
countries %>% select(country_name = country, gdp_pc)
## country_name gdp_pc
## 1 USA 9519
## 2 China 8898
## 3 Sudan 19708
## 4 USA 14481
## 5 China 15620
## 6 Sudan 4375
## 7 USA 6387
## 8 China 2893
## 9 Sudan 16812
Renaming
As illustrated in the last line of code above you can rename variables using select()
. But this can also be done using rename()
:
# Rename GDP per capita
countries %>% rename(GDP.PC = gdp_pc)
## country year GDP.PC
## 1 USA 1994 9519
## 2 China 1994 8898
## 3 Sudan 1994 19708
## 4 USA 1995 14481
## 5 China 1995 15620
## 6 Sudan 1995 4375
## 7 USA 1996 6387
## 8 China 1996 2893
## 9 Sudan 1996 16812
Unique values
Removing duplicate observations can be useful, but be careful: entire rows will be deleted. Use distinct()
:
countries %>% distinct(country)
## country year gdp_pc
## 1 USA 1994 9519
## 2 China 1994 8898
## 3 Sudan 1994 19708
Note that the function keeps the first non-duplicate. This is more useful when you suspect that duplicate values of the following kind have slipped into the dataset:
countries2 <- rbind(
data.frame(country = "USA", year = 1994, gdp_pc = 10000),
countries
)
countries2
## country year gdp_pc
## 1 USA 1994 10000
## 2 USA 1994 9519
## 3 China 1994 8898
## 4 Sudan 1994 19708
## 5 USA 1995 14481
## 6 China 1995 15620
## 7 Sudan 1995 4375
## 8 USA 1996 6387
## 9 China 1996 2893
## 10 Sudan 1996 16812
Note that we all of a sudden have two USA 1994 observations. Delete one of them using distinct()
.
countries2 %>% distinct(country, year)
## country year gdp_pc
## 1 USA 1994 10000
## 2 China 1994 8898
## 3 Sudan 1994 19708
## 4 USA 1995 14481
## 5 China 1995 15620
## 6 Sudan 1995 4375
## 7 USA 1996 6387
## 8 China 1996 2893
## 9 Sudan 1996 16812
New variables
Use mutate()
to create new variables or to modify existing variables. Here are a few examples.
# Create a new variable that has GDP per capita in 1000s
countries %>% mutate(gdppc_1k = gdp_pc / 1000)
## country year gdp_pc gdppc_1k
## 1 USA 1994 9519 9.519
## 2 China 1994 8898 8.898
## 3 Sudan 1994 19708 19.708
## 4 USA 1995 14481 14.481
## 5 China 1995 15620 15.620
## 6 Sudan 1995 4375 4.375
## 7 USA 1996 6387 6.387
## 8 China 1996 2893 2.893
## 9 Sudan 1996 16812 16.812
# Create a new variable with lower-case country names
countries %>% mutate(country_lc = tolower(country))
## country year gdp_pc country_lc
## 1 USA 1994 9519 usa
## 2 China 1994 8898 china
## 3 Sudan 1994 19708 sudan
## 4 USA 1995 14481 usa
## 5 China 1995 15620 china
## 6 Sudan 1995 4375 sudan
## 7 USA 1996 6387 usa
## 8 China 1996 2893 china
## 9 Sudan 1996 16812 sudan
# Both in one statement
countries %>% mutate(gdppc_1k = gdp_pc / 1000,
country_lc = tolower(country))
## country year gdp_pc gdppc_1k country_lc
## 1 USA 1994 9519 9.519 usa
## 2 China 1994 8898 8.898 china
## 3 Sudan 1994 19708 19.708 sudan
## 4 USA 1995 14481 14.481 usa
## 5 China 1995 15620 15.620 china
## 6 Sudan 1995 4375 4.375 sudan
## 7 USA 1996 6387 6.387 usa
## 8 China 1996 2893 2.893 china
## 9 Sudan 1996 16812 16.812 sudan
A lot of times it makes more sense to just overwrite an existing variable rather than adding a variable.
countries %>% mutate(country = tolower(country))
## country year gdp_pc
## 1 usa 1994 9519
## 2 china 1994 8898
## 3 sudan 1994 19708
## 4 usa 1995 14481
## 5 china 1995 15620
## 6 sudan 1995 4375
## 7 usa 1996 6387
## 8 china 1996 2893
## 9 sudan 1996 16812
mutate()
can be combined with two other useful functions: ifelse()
and revalue()
.
ifelse()
ifelse()
is a logical function that is useful for modifying variables in datasets (or individual vectors). Here’s an illustration of how it works:
numbers <- 1:10
ifelse(numbers > 5, numbers * 10, numbers / 10)
## [1] 0.1 0.2 0.3 0.4 0.5 60.0 70.0 80.0 90.0 100.0
The function takes three arguments: (1) a logical test, (2) what to do if the test is true, and (3) what to do if the test is false. Thus, in the code above numbers greater than 5 are multiplied by 10 and numbers less than or equal to 5 are divided by 10.
We can apply this to variables in data frames. Let’s use the world-small.csv
dataset to illustrate (download here).
# Read world-small dataset
world <- read.csv("world-small.csv")
head(world)
## country region gdppcap08 polityIV
## 1 Albania C&E Europe 7715 17.8
## 2 Algeria Africa 8033 10.0
## 3 Angola Africa 5899 8.0
## 4 Argentina S. America 14333 18.0
## 5 Armenia C&E Europe 6070 15.0
## 6 Australia Asia-Pacific 35677 20.0
# Create a new variable equal to "democracy" if a country has
# a polity score >= 15 and "autocracy" otherwise
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, "democracy", "autocracy"))
head(world)
## country region gdppcap08 polityIV democracy
## 1 Albania C&E Europe 7715 17.8 democracy
## 2 Algeria Africa 8033 10.0 autocracy
## 3 Angola Africa 5899 8.0 autocracy
## 4 Argentina S. America 14333 18.0 democracy
## 5 Armenia C&E Europe 6070 15.0 democracy
## 6 Australia Asia-Pacific 35677 20.0 democracy
# Or represent this information as a dummy variable instead
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, 1, 0))
head(world)
## country region gdppcap08 polityIV democracy
## 1 Albania C&E Europe 7715 17.8 1
## 2 Algeria Africa 8033 10.0 0
## 3 Angola Africa 5899 8.0 0
## 4 Argentina S. America 14333 18.0 1
## 5 Armenia C&E Europe 6070 15.0 1
## 6 Australia Asia-Pacific 35677 20.0 1
We first created a new variable called democracy
equal to “democracy” if a country has a Polity IV score of at least 15 and “autocracy” otherwise. We then overwrote this variable with a “dummy variable” representing the same information using zeroes and ones.
General note about dummy variables: Dummy variables — sometimes called “indicator variables” or “binary variables” — are extremely useful for representing binary information. Their usefulness in part stems from their applicability in regression models and in part from how they can be used to summarize information. For example, taking the mean of the dummy variable democracy
above — that is, mean(world$democracy)
— gives us the proportion of countries that we classified as “democracy”.
revalue()
We often want to recode the categories of a variable. For example, take a look at the region
variable in world
:
table(world$region)
##
## Africa Asia-Pacific C&E Europe Middle East N. America
## 42 24 25 16 3
## S. America Scandinavia W. Europe
## 19 4 12
Currently this variable has eight categories. Say we wanted to simplify it a little bit by combining the three regions in Europe. We also want to rename the two regions in America. This can be accomplished with revalue()
from the plyr package.
world$region <- revalue(world$region, c(
"C&E Europe" = "Europe",
"Scandinavia" = "Europe",
"W. Europe" = "Europe",
"N. America" = "North America",
"S. America" = "South America"))
table(world$region)
##
## Africa Asia-Pacific Europe Middle East North America
## 42 24 41 16 3
## South America
## 19
This can also be accomplished with ifelse()
, but we would have to nest several ifelse()
statements. For this reason I prefer revalue()
.
A note on factor variables
The read.csv()
function automatically converts variables that contain strings to the class “factor”. Note, for example, that both the country
and region
variables in the world
dataset are represented as factors.
class(world$country)
## [1] "factor"
class(world$region)
## [1] "factor"
These variables are useful in regression models in R. For example, including region
in an OLS model (using command lm()
) would automatically represent all regions as dummy variables except one left as the reference category. If that’s not making sense to you at this stage, don’t worry at all.
But factor variables can also cause issues if you’re not careful. In particular, factor variables are sometimes displayed to the user as a certain set of numbers, yet are stored internally as an entirely different set of numbers. For example, say the gdppcap08
variable in world
had been read as a factor. (Fortunately, in reality it was read correctly as numeric, but this sometimes isn’t the case.)
# Recode gdp per capita to factor
# NOTE: Only for illustrative purposes!
world <- world %>% mutate(gdppcap08 = as.factor(gdppcap08))
head(world)
## country region gdppcap08 polityIV democracy
## 1 Albania Europe 7715 17.8 1
## 2 Algeria Africa 8033 10.0 0
## 3 Angola Africa 5899 8.0 0
## 4 Argentina South America 14333 18.0 1
## 5 Armenia Europe 6070 15.0 1
## 6 Australia Asia-Pacific 35677 20.0 1
class(world$gdppcap08)
## [1] "factor"
Ok, the GDP variable is now a factor variable. Note, though, that it looks just like a numeric variable. You might therefore be tempted to try to do things like mean(world$gdppcap08)
, yet this would return NA
instead of an actual value as before. You might also be tempted to recode the GDP variable from factor to numeric using the following code:
head(world %>% mutate(gdppcap08 = as.numeric(gdppcap08)))
## country region gdppcap08 polityIV democracy
## 1 Albania Europe 76 17.8 1
## 2 Algeria Africa 79 10.0 0
## 3 Angola Africa 67 8.0 0
## 4 Argentina South America 99 18.0 1
## 5 Armenia Europe 69 15.0 1
## 6 Australia Asia-Pacific 132 20.0 1
Something has gone seriously wrong — look at the values of gdppcap08
. Here’s the learning lesson: to convert a factor variable to numeric, you have to recode it to character first. Use as.numeric(as.character(variable))
, where variable
is the factor variable to convert.
world <- world %>% mutate(gdppcap08 = as.numeric(as.character(gdppcap08)))
head(world)
## country region gdppcap08 polityIV democracy
## 1 Albania Europe 7715 17.8 1
## 2 Algeria Africa 8033 10.0 0
## 3 Angola Africa 5899 8.0 0
## 4 Argentina South America 14333 18.0 1
## 5 Armenia Europe 6070 15.0 1
## 6 Australia Asia-Pacific 35677 20.0 1
class(world$gdppcap08)
## [1] "numeric"
Chaining
What if you wanted to apply multiple functions to one data frame? This is where chaining is elegant and facilitates debugging.
More specifically, say we wanted to use countries
to create a new data frame called countries_new
, which should have observations from years 1995 and 1996 (dropping 1994), should be sorted by country name (in lower case), and should have a new variable equal to GDP per capita in 1000s.
Here’s how we could do this without chaining:
countries_new <- filter(countries, year != 1994) #drop year 1994
countries_new <- arrange(countries_new, country) #sort by country names
countries_new <- mutate(countries_new, country = tolower(country), #convert name to lower-case
gdppc_1k = gdp_pc / 1000) #create GDP pc in 1000s
countries_new
## country year gdp_pc gdppc_1k
## 1 china 1995 15620 15.620
## 2 china 1996 2893 2.893
## 3 sudan 1995 4375 4.375
## 4 sudan 1996 16812 16.812
## 5 usa 1995 14481 14.481
## 6 usa 1996 6387 6.387
Here’s the same thing using chaining:
countries_new <- countries %>%
filter(year != 1994) %>%
arrange(country) %>%
mutate(country = tolower(country), gdppc_1k = gdp_pc / 1000)
countries_new
## country year gdp_pc gdppc_1k
## 1 china 1995 15620 15.620
## 2 china 1996 2893 2.893
## 3 sudan 1995 4375 4.375
## 4 sudan 1996 16812 16.812
## 5 usa 1995 14481 14.481
## 6 usa 1996 6387 6.387
I find code that makes use of chaining more readable. Chaining always begins with specifying the data frame we want to operate on (e.g., countries
). Every subsequent statement will operate on this data frame, starting with the function that comes right after the data frame and working its way down. In our case, the first thing we’ll do to countries
is to subset it. We’ll then sort it by country name. Lastly, we’ll overwrite the country name to be lower-case and create a new variable representing GDP per capita in 1000s.
Exercises
Read the
world-small.csv
dataset into R and store it in an object calledworld
.Subset
world
to European countries. Save this subset as a new data frame calledeurope
.- Add two variables to
europe
:- A variable that recodes
polityIV
from 0-20 to -10-10. - A variable that categorizes a country as “rich” or “poor” based on some cutoff of
gdppcap08
you think is reasonable.
- A variable that recodes
Drop the
region
variable ineurope
(keep the rest).Sort
europe
based on Polity IV.Repeat Exercises 2-5 using chaining.
What was the world’s mean GDP per capita in 2008? Polity IV score?
What was Africa’s mean GDP per capita and Polity IV score?
What was the poorest country in the world in 2008? Richest?
How many countries in Europe are “rich” according to your coding? How many are poor? What percentage have Polity IV scores of at least 18?