Up to this point, you have been given datasets to work with. The real world is messier. Many, many times you’ll have to merge and append datasets to have something to work with.
Merging means matching rows based on one or more ID variables. Why would you want to do this? Say you have economic development data on all countries in the world. You’re interested in how economic development affects democracy. Unfortunately, the dataset doesn’t have a variable measuring levels of democracy, so you go out and find another dataset that measures democracy and match this second dataset with the first one. You can now look at the correlation between democracy and economic development.
To illustrate, say you had two datasets that look like this:
# Generate toy dataset 1:
demo <- data.frame(country = c("USA", "Albania", "Turkey", "China", "Sudan"),
democracy_score = c(19, 16, 16, 5, 10))
demo
## country democracy_score
## 1 USA 19
## 2 Albania 16
## 3 Turkey 16
## 4 China 5
## 5 Sudan 10
# Generate toy dataset 2:
econ <- data.frame(country = c("China", "Albania", "Turkey", "USA", "Sudan"),
gdp_pc = c(12000, 10000, 9000, 20000, 500))
econ
## country gdp_pc
## 1 China 12000
## 2 Albania 10000
## 3 Turkey 9000
## 4 USA 20000
## 5 Sudan 500
We can merge these datasets using R’s merge()
command, where by
specifies the variable the two datasets has in common (usually called an “ID” variable):
df <- merge(demo, econ, by = "country")
df
## country democracy_score gdp_pc
## 1 Albania 16 10000
## 2 China 5 12000
## 3 Sudan 10 500
## 4 Turkey 16 9000
## 5 USA 19 20000
You can merge on more than one variable. Say you had two datasets that look like this:
# Generate toy dataset 1:
demo <- data.frame(expand.grid(country = c("USA", "China", "Sudan"),
year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0))
demo
## country year democracy_score
## 1 USA 1994 11
## 2 China 1994 19
## 3 Sudan 1994 9
## 4 USA 1995 16
## 5 China 1995 13
## 6 Sudan 1995 15
## 7 USA 1996 17
## 8 China 1996 3
## 9 Sudan 1996 1
# Generate toy dataset 2:
econ <- data.frame(expand.grid(year = 1994:1996,
country = c("USA", "China", "Sudan")),
gdp_pc = round(runif(9, 1000, 20000), 0))
econ
## year country gdp_pc
## 1 1994 USA 9769
## 2 1995 USA 15519
## 3 1996 USA 6386
## 4 1994 China 7077
## 5 1995 China 15253
## 6 1996 China 1541
## 7 1994 Sudan 3290
## 8 1995 Sudan 5117
## 9 1996 Sudan 15018
merge(demo, econ, by = c("country", "year"))
## country year democracy_score gdp_pc
## 1 China 1994 19 7077
## 2 China 1995 13 15253
## 3 China 1996 3 1541
## 4 Sudan 1994 9 3290
## 5 Sudan 1995 15 5117
## 6 Sudan 1996 1 15018
## 7 USA 1994 11 9769
## 8 USA 1995 16 15519
## 9 USA 1996 17 6386
Here, we’re merging on both country and year.
If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x
, all.y
, or all
:
econ <- econ[-c(5, 9), ] #delete obs. 5 & 9 for illustrative purposes
econ
## year country gdp_pc
## 1 1994 USA 9769
## 2 1995 USA 15519
## 3 1996 USA 6386
## 4 1994 China 7077
## 6 1996 China 1541
## 7 1994 Sudan 3290
## 8 1995 Sudan 5117
dim(demo); dim(econ) #different number of observations (rows)
## [1] 9 3
## [1] 7 3
merge(demo, econ, by = c("country", "year")) #keep only matching observations
## country year democracy_score gdp_pc
## 1 China 1994 19 7077
## 2 China 1996 3 1541
## 3 Sudan 1994 9 3290
## 4 Sudan 1995 15 5117
## 5 USA 1994 11 9769
## 6 USA 1995 16 15519
## 7 USA 1996 17 6386
merge(demo, econ, by = c("country", "year"), all.x = TRUE) #keep all observations in 'demo'
## country year democracy_score gdp_pc
## 1 USA 1994 11 9769
## 2 USA 1995 16 15519
## 3 USA 1996 17 6386
## 4 China 1994 19 7077
## 5 China 1995 13 NA
## 6 China 1996 3 1541
## 7 Sudan 1994 9 3290
## 8 Sudan 1995 15 5117
## 9 Sudan 1996 1 NA
In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.
Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:
demo <- data.frame(expand.grid(country = c("USA", "China", "Sudan"),
year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0))
region <- data.frame(country = c("USA", "China", "Sudan"),
region = c("America", "Asia", "Africa"))
demo
## country year democracy_score
## 1 USA 1994 4
## 2 China 1994 8
## 3 Sudan 1994 4
## 4 USA 1995 15
## 5 China 1995 14
## 6 Sudan 1995 16
## 7 USA 1996 7
## 8 China 1996 16
## 9 Sudan 1996 2
region
## country region
## 1 USA America
## 2 China Asia
## 3 Sudan Africa
A one-to-many merge of these data frames would look like this:
merge(demo, region, by = "country")
## country year democracy_score region
## 1 China 1994 8 Asia
## 2 China 1995 14 Asia
## 3 China 1996 16 Asia
## 4 Sudan 1995 16 Africa
## 5 Sudan 1994 4 Africa
## 6 Sudan 1996 2 Africa
## 7 USA 1994 4 America
## 8 USA 1996 7 America
## 9 USA 1995 15 America
Exercise: Test yourself by merging the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.
df1 <- data.frame(name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
"Bjorn", "Oden", "Dennis"),
treatment_gr = c(rep(c(1, 2, 3), each = 3)),
weight_p1 = round(runif(9, 100, 200), 0))
df2 <- data.frame(name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
"Oden", "Bjorn"),
weight_p2 = round(runif(8, 100, 200), 0))
df3 <- data.frame(treatment_gr = c(1, 2, 3),
type = c("dog-lovers", "cat-lovers", "all-lovers"))
Appending means matching datasets vertically. We can do this in R using rbind()
. The two dataframes you’re appending must have identical variable names. Here’s an example:
df1 <- data.frame(year = rep(1990:1995, 2),
country = c(rep("country1", 6), rep("country2", 6)))
df2 <- data.frame(year = rep(1996:2000, 2),
country = c(rep("country1", 5), rep("country2", 5)))
df <- rbind(df1, df2) #append
require(dplyr) #to display output in certain order
arrange(df, country, year)
## year country
## 1 1990 country1
## 2 1991 country1
## 3 1992 country1
## 4 1993 country1
## 5 1994 country1
## 6 1995 country1
## 7 1996 country1
## 8 1997 country1
## 9 1998 country1
## 10 1999 country1
## 11 2000 country1
## 12 1990 country2
## 13 1991 country2
## 14 1992 country2
## 15 1993 country2
## 16 1994 country2
## 17 1995 country2
## 18 1996 country2
## 19 1997 country2
## 20 1998 country2
## 21 1999 country2
## 22 2000 country2
A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981).
Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.
Inequality data are available here. Use the dataset with six measures of inequality.
Tax data that can be used to measure fiscal capacity/redistribution are available here. Use the State Government Tax Collections
link.
Download the data from these sites. Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012. (Though not necessary, try to use the piping functionality we talked about during Week 1 to execute several commands to the same data frames.)
Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.
The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint: gsub()
.
The code used in this tutorial is available here.