Merging
Say you’re interested in how economic development affects democracy across countries, but all you have are two separate datasets on these two factors. What you need to do is to combine the two. When you’re done with this, you also need information on what world region each country is located in. These tasks can be accomplished with one-to-one merging and one-to-many merging.
One-to-one merging
To illustrate, say you had two datasets that look like this:
# Generate toy dataset 1:
demo <- data.frame(
country = c("USA", "Albania", "Turkey", "China", "Sudan"),
democracy_score = c(19, 16, 16, 5, 10)
)
demo
## country democracy_score
## 1 USA 19
## 2 Albania 16
## 3 Turkey 16
## 4 China 5
## 5 Sudan 10
# Generate toy dataset 2:
econ <- data.frame(
country = c("China", "Albania", "Turkey", "USA", "Sudan"),
gdp_pc = c(12000, 10000, 9000, 20000, 500)
)
econ
## country gdp_pc
## 1 China 12000
## 2 Albania 10000
## 3 Turkey 9000
## 4 USA 20000
## 5 Sudan 500
We can merge these datasets using R’s merge()
command, where by
specifies the variable the two datasets has in common (usually called an “ID” variable):
df <- merge(demo, econ, by = "country")
df
## country democracy_score gdp_pc
## 1 Albania 16 10000
## 2 China 5 12000
## 3 Sudan 10 500
## 4 Turkey 16 9000
## 5 USA 19 20000
You can merge on more than one variable. Say you had two datasets that look like this:
# Generate toy dataset 1:
demo <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0)
)
demo
## country year democracy_score
## 1 USA 1994 9
## 2 China 1994 10
## 3 Sudan 1994 4
## 4 USA 1995 19
## 5 China 1995 7
## 6 Sudan 1995 17
## 7 USA 1996 16
## 8 China 1996 13
## 9 Sudan 1996 5
# Generate toy dataset 2:
econ <- data.frame(
expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
econ
## year country gdp_pc
## 1 1994 USA 10230
## 2 1995 USA 11714
## 3 1996 USA 11904
## 4 1994 China 9735
## 5 1995 China 2908
## 6 1996 China 9009
## 7 1994 Sudan 8489
## 8 1995 Sudan 17027
## 9 1996 Sudan 19729
merge(demo, econ, by = c("country", "year"))
## country year democracy_score gdp_pc
## 1 China 1994 10 9735
## 2 China 1995 7 2908
## 3 China 1996 13 9009
## 4 Sudan 1994 4 8489
## 5 Sudan 1995 17 17027
## 6 Sudan 1996 5 19729
## 7 USA 1994 9 10230
## 8 USA 1995 19 11714
## 9 USA 1996 16 11904
Here, we’re merging on both country and year.
If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x
, all.y
, or all
:
econ <- econ[-c(5, 9), ] #delete obs. 5 & 9 for illustrative purposes
econ
## year country gdp_pc
## 1 1994 USA 10230
## 2 1995 USA 11714
## 3 1996 USA 11904
## 4 1994 China 9735
## 6 1996 China 9009
## 7 1994 Sudan 8489
## 8 1995 Sudan 17027
dim(demo) #different number of observations (rows) than 'econ'
## [1] 9 3
dim(econ) #different number of observations (rows) than 'demo'
## [1] 7 3
merge(demo, econ, by = c("country", "year")) #keep only matching observations
## country year democracy_score gdp_pc
## 1 China 1994 10 9735
## 2 China 1996 13 9009
## 3 Sudan 1994 4 8489
## 4 Sudan 1995 17 17027
## 5 USA 1994 9 10230
## 6 USA 1995 19 11714
## 7 USA 1996 16 11904
merge(demo, econ, by = c("country", "year"), all.x = TRUE) #keep all observations in 'demo'
## country year democracy_score gdp_pc
## 1 USA 1994 9 10230
## 2 USA 1995 19 11714
## 3 USA 1996 16 11904
## 4 China 1994 10 9735
## 5 China 1995 7 NA
## 6 China 1996 13 9009
## 7 Sudan 1994 4 8489
## 8 Sudan 1995 17 17027
## 9 Sudan 1996 5 NA
One-to-many merging
In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.
Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:
demo <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0)
)
demo
## country year democracy_score
## 1 USA 1994 7
## 2 China 1994 16
## 3 Sudan 1994 6
## 4 USA 1995 4
## 5 China 1995 5
## 6 Sudan 1995 15
## 7 USA 1996 16
## 8 China 1996 5
## 9 Sudan 1996 14
region <- data.frame(
country = c("USA", "China", "Sudan"),
region = c("America", "Asia", "Africa")
)
region
## country region
## 1 USA America
## 2 China Asia
## 3 Sudan Africa
A one-to-many merge of these data frames would look like this:
merge(demo, region, by = "country")
## country year democracy_score region
## 1 China 1994 16 Asia
## 2 China 1995 5 Asia
## 3 China 1996 5 Asia
## 4 Sudan 1995 15 Africa
## 5 Sudan 1994 6 Africa
## 6 Sudan 1996 14 Africa
## 7 USA 1994 7 America
## 8 USA 1996 16 America
## 9 USA 1995 4 America
Appending
Appending means matching datasets vertically. We can do this in R using rbind()
. The two dataframes you’re appending must have identical variable names. Here’s an example:
df1 <- data.frame(
year = rep(1990:1995, 2),
country = c(rep("country1", 6), rep("country2", 6))
)
df2 <- data.frame(
year = rep(1996:2000, 2),
country = c(rep("country1", 5), rep("country2", 5))
)
df <- rbind(df1, df2) #append
require(dplyr) #to display output in certain order
arrange(df, country, year)
## year country
## 1 1990 country1
## 2 1991 country1
## 3 1992 country1
## 4 1993 country1
## 5 1994 country1
## 6 1995 country1
## 7 1996 country1
## 8 1997 country1
## 9 1998 country1
## 10 1999 country1
## 11 2000 country1
## 12 1990 country2
## 13 1991 country2
## 14 1992 country2
## 15 1993 country2
## 16 1994 country2
## 17 1995 country2
## 18 1996 country2
## 19 1997 country2
## 20 1998 country2
## 21 1999 country2
## 22 2000 country2
Exercises
Warm-up: Merge the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.
df1 <- data.frame(
name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
"Bjorn", "Oden", "Dennis"),
treatment_gr = c(rep(c(1, 2, 3), each = 3)),
weight_p1 = round(runif(9, 100, 200), 0)
)
df2 <- data.frame(
name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
"Oden", "Bjorn"),
weight_p2 = round(runif(8, 100, 200), 0)
)
df3 <- data.frame(
treatment_gr = c(1, 2, 3),
type = c("dog-lovers", "cat-lovers", "all-lovers")
)
More extensive: A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981). Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.
- Download data on inequality and tax collection (links below). Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
- Dataset with six measures of inequality (originally from Mark W. Frank).
- Tax data that can be used to measure fiscal capacity/redistribution. Use the “State Government Tax Collections” link.
Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012.
Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.
The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint:
gsub()
.