1 Merging and Appending Data

Up to this point, you have been given datasets to work with. The real world is messier. Many, many times you’ll have to merge and append datasets to have something to work with.

1.1 Merging

Merging means matching rows based on one or more ID variables. Why would you want to do this? Say you have economic development data on all countries in the world. You’re interested in how economic development affects democracy. Unfortunately, the dataset doesn’t have a variable measuring levels of democracy, so you go out and find another dataset that measures democracy and match this second dataset with the first one. You can now look at the correlation between democracy and economic development.

1.1.1 One-to-one merging

To illustrate, say you had two datasets that look like this:

# Generate toy dataset 1:
demo <- data.frame(country = c("USA", "Albania", "Turkey", "China", "Sudan"),
                   democracy_score = c(19, 16, 16, 5, 10))
demo

##   country democracy_score
## 1     USA              19
## 2 Albania              16
## 3  Turkey              16
## 4   China               5
## 5   Sudan              10

# Generate toy dataset 2:
econ <- data.frame(country = c("China", "Albania", "Turkey", "USA", "Sudan"),
                   gdp_pc = c(12000, 10000, 9000, 20000, 500))
econ

##   country gdp_pc
## 1   China  12000
## 2 Albania  10000
## 3  Turkey   9000
## 4     USA  20000
## 5   Sudan    500

We can merge these datasets using R’s merge() command, where by specifies the variable the two datasets has in common (usually called an “ID” variable):

df <- merge(demo, econ, by = "country")
df

##   country democracy_score gdp_pc
## 1 Albania              16  10000
## 2   China               5  12000
## 3   Sudan              10    500
## 4  Turkey              16   9000
## 5     USA              19  20000

You can merge on more than one variable. Say you had two datasets that look like this:

# Generate toy dataset 1:
demo <- data.frame(expand.grid(country = c("USA", "China", "Sudan"),
                               year = 1994:1996),
                   democracy_score = round(runif(9, 0, 20), 0))
demo

##   country year democracy_score
## 1     USA 1994              11
## 2   China 1994              19
## 3   Sudan 1994               9
## 4     USA 1995              16
## 5   China 1995              13
## 6   Sudan 1995              15
## 7     USA 1996              17
## 8   China 1996               3
## 9   Sudan 1996               1

# Generate toy dataset 2:
econ <- data.frame(expand.grid(year = 1994:1996,
                               country = c("USA", "China", "Sudan")),
                   gdp_pc = round(runif(9, 1000, 20000), 0))
econ

##   year country gdp_pc
## 1 1994     USA   9769
## 2 1995     USA  15519
## 3 1996     USA   6386
## 4 1994   China   7077
## 5 1995   China  15253
## 6 1996   China   1541
## 7 1994   Sudan   3290
## 8 1995   Sudan   5117
## 9 1996   Sudan  15018

merge(demo, econ, by = c("country", "year"))

##   country year democracy_score gdp_pc
## 1   China 1994              19   7077
## 2   China 1995              13  15253
## 3   China 1996               3   1541
## 4   Sudan 1994               9   3290
## 5   Sudan 1995              15   5117
## 6   Sudan 1996               1  15018
## 7     USA 1994              11   9769
## 8     USA 1995              16  15519
## 9     USA 1996              17   6386

Here, we’re merging on both country and year.

If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x, all.y, or all:

econ <- econ[-c(5, 9), ]       #delete obs. 5 & 9 for illustrative purposes
econ

##   year country gdp_pc
## 1 1994     USA   9769
## 2 1995     USA  15519
## 3 1996     USA   6386
## 4 1994   China   7077
## 6 1996   China   1541
## 7 1994   Sudan   3290
## 8 1995   Sudan   5117

dim(demo); dim(econ)           #different number of observations (rows)

## [1] 9 3

## [1] 7 3

merge(demo, econ, by = c("country", "year"))                 #keep only matching observations

##   country year democracy_score gdp_pc
## 1   China 1994              19   7077
## 2   China 1996               3   1541
## 3   Sudan 1994               9   3290
## 4   Sudan 1995              15   5117
## 5     USA 1994              11   9769
## 6     USA 1995              16  15519
## 7     USA 1996              17   6386

merge(demo, econ, by = c("country", "year"), all.x = TRUE)   #keep all observations in 'demo'

##   country year democracy_score gdp_pc
## 1     USA 1994              11   9769
## 2     USA 1995              16  15519
## 3     USA 1996              17   6386
## 4   China 1994              19   7077
## 5   China 1995              13     NA
## 6   China 1996               3   1541
## 7   Sudan 1994               9   3290
## 8   Sudan 1995              15   5117
## 9   Sudan 1996               1     NA

1.1.2 One-to-many merging

In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.

Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:

demo <- data.frame(expand.grid(country = c("USA", "China", "Sudan"),
                               year = 1994:1996),
                   democracy_score = round(runif(9, 0, 20), 0))

region <- data.frame(country = c("USA", "China", "Sudan"),
                     region = c("America", "Asia", "Africa"))
demo

##   country year democracy_score
## 1     USA 1994               4
## 2   China 1994               8
## 3   Sudan 1994               4
## 4     USA 1995              15
## 5   China 1995              14
## 6   Sudan 1995              16
## 7     USA 1996               7
## 8   China 1996              16
## 9   Sudan 1996               2

region

##   country  region
## 1     USA America
## 2   China    Asia
## 3   Sudan  Africa

A one-to-many merge of these data frames would look like this:

merge(demo, region, by = "country")

##   country year democracy_score  region
## 1   China 1994               8    Asia
## 2   China 1995              14    Asia
## 3   China 1996              16    Asia
## 4   Sudan 1995              16  Africa
## 5   Sudan 1994               4  Africa
## 6   Sudan 1996               2  Africa
## 7     USA 1994               4 America
## 8     USA 1996               7 America
## 9     USA 1995              15 America

Exercise: Test yourself by merging the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.

df1 <- data.frame(name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
                           "Bjorn", "Oden", "Dennis"),
                  treatment_gr = c(rep(c(1, 2, 3), each = 3)),
                  weight_p1 = round(runif(9, 100, 200), 0))
df2 <- data.frame(name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
                           "Oden", "Bjorn"),
                  weight_p2 = round(runif(8, 100, 200), 0))
df3 <- data.frame(treatment_gr = c(1, 2, 3),
                  type = c("dog-lovers", "cat-lovers", "all-lovers"))

1.2 Appending

Appending means matching datasets vertically. We can do this in R using rbind(). The two dataframes you’re appending must have identical variable names. Here’s an example:

df1 <- data.frame(year = rep(1990:1995, 2),
                  country = c(rep("country1", 6), rep("country2", 6)))
df2 <- data.frame(year = rep(1996:2000, 2),
                  country = c(rep("country1", 5), rep("country2", 5)))

df <- rbind(df1, df2)         #append

require(dplyr)                #to display output in certain order
arrange(df, country, year)

##    year  country
## 1  1990 country1
## 2  1991 country1
## 3  1992 country1
## 4  1993 country1
## 5  1994 country1
## 6  1995 country1
## 7  1996 country1
## 8  1997 country1
## 9  1998 country1
## 10 1999 country1
## 11 2000 country1
## 12 1990 country2
## 13 1991 country2
## 14 1992 country2
## 15 1993 country2
## 16 1994 country2
## 17 1995 country2
## 18 1996 country2
## 19 1997 country2
## 20 1998 country2
## 21 1999 country2
## 22 2000 country2

1.3 Exercises

A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981).

Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.

Inequality data are available here. Use the dataset with six measures of inequality.
Tax data that can be used to measure fiscal capacity/redistribution are available here. Use the State Government Tax Collections link.

Download the data from these sites. Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012. (Though not necessary, try to use the piping functionality we talked about during Week 1 to execute several commands to the same data frames.)
Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.
The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint: gsub().

The code used in this tutorial is available here.