Wide versus long data
Illustrating the difference between wide and long datasets is easiest using an example. Here are two datasets with the exact same information represented in wide and long form respectively (imagine that avgtemp
represents average temperature in Celsius):
# Create long dataset
country_long <- data.frame(
expand.grid(country = c("Sweden", "Denmark", "Norway"), year = 1994:1996),
avgtemp = round(runif(9, 3, 12), 0)
)
country_long
## country year avgtemp
## 1 Sweden 1994 6
## 2 Denmark 1994 6
## 3 Norway 1994 3
## 4 Sweden 1995 5
## 5 Denmark 1995 8
## 6 Norway 1995 11
## 7 Sweden 1996 7
## 8 Denmark 1996 8
## 9 Norway 1996 7
# Create wide dataset
country_wide <- data.frame(
country = c("Sweden", "Denmark", "Norway"),
avgtemp.1994 = country_long$avgtemp[1:3],
avgtemp.1995 = country_long$avgtemp[4:6],
avgtemp.1996 = country_long$avgtemp[7:9])
country_wide
## country avgtemp.1994 avgtemp.1995 avgtemp.1996
## 1 Sweden 6 5 7
## 2 Denmark 6 8 8
## 3 Norway 3 11 7
As is obvious, the long dataset separates the unit of analysis (country-year) into two separate variables. The wide dataset combines one of the keys (year
) with the value variable (avgtemp
).
A case for long data
There are many reasons to prefer datasets structured in long form. Repeating some of the points made in Hadley Wickham’s excellent paper on the topic, here are three reasons why you should attempt to structure your data in long form:
If you have many value variables, it is difficult to summarize wide-form datasets at a glance (which in turn makes it hard to identify mistakes in the data). For example, imagine we have a dataset with 50 years and 10 value variables of interest — this would result in 500 columns in wide form. Summarizing each column to look for strange observations, or simply understanding which variables are included in the dataset, becomes difficult in this case.
Structuring data as key-value pairs — as is done in long-form datasets — facilitates conceptual clarity. For example, in
country_long
above, it is clear that the unit of analysis is country-year — or, put differently, that the variablescountry
andyear
jointly constitute the key in the dataset. In wide-form datasets, one of the variables that constitutes the unit of analysis is mixed with a variable that holds values. (Read more about this in Hadley’s paper referenced above.)Long-form datasets are often required for advanced statistical analysis and graphing. For example, if you wanted to run a regression with year and/or country fixed effects, you would have to structure your data in long form. Furthermore, many graphing packages, including
ggplot
, rely on your data being in long form.
Wide-to-long conversion
To illustrate how to convert a dataset from wide to long format, we’ll use a UNICEF dataset on under-five child mortality across 196 countries. Download the dataset here; it is based on data that can be found at www.childmortality.org. The under-five mortality rate is expressed as the number of under-five deaths per 1,000 live births.
Set your working directory and read the file:
u5mr <- read.csv("unicef-u5mr.csv")
This dataset has 196 rows, one for each country, and 67 variables:
dim(u5mr) #dimensions of the data frame
## [1] 196 67
names(u5mr) #the variable names
## [1] "CountryName" "U5MR.1950" "U5MR.1951" "U5MR.1952" "U5MR.1953"
## [6] "U5MR.1954" "U5MR.1955" "U5MR.1956" "U5MR.1957" "U5MR.1958"
## [11] "U5MR.1959" "U5MR.1960" "U5MR.1961" "U5MR.1962" "U5MR.1963"
## [16] "U5MR.1964" "U5MR.1965" "U5MR.1966" "U5MR.1967" "U5MR.1968"
## [21] "U5MR.1969" "U5MR.1970" "U5MR.1971" "U5MR.1972" "U5MR.1973"
## [26] "U5MR.1974" "U5MR.1975" "U5MR.1976" "U5MR.1977" "U5MR.1978"
## [31] "U5MR.1979" "U5MR.1980" "U5MR.1981" "U5MR.1982" "U5MR.1983"
## [36] "U5MR.1984" "U5MR.1985" "U5MR.1986" "U5MR.1987" "U5MR.1988"
## [41] "U5MR.1989" "U5MR.1990" "U5MR.1991" "U5MR.1992" "U5MR.1993"
## [46] "U5MR.1994" "U5MR.1995" "U5MR.1996" "U5MR.1997" "U5MR.1998"
## [51] "U5MR.1999" "U5MR.2000" "U5MR.2001" "U5MR.2002" "U5MR.2003"
## [56] "U5MR.2004" "U5MR.2005" "U5MR.2006" "U5MR.2007" "U5MR.2008"
## [61] "U5MR.2009" "U5MR.2010" "U5MR.2011" "U5MR.2012" "U5MR.2013"
## [66] "U5MR.2014" "U5MR.2015"
Let’s convert it to long format, where the unit of analysis is country-year. That is, we’ll have three variables indicating country, year, and U5MR. This can be done using gather()
in the tidyr
package.
require(tidyr)
## Loading required package: tidyr
u5mr_long <- u5mr %>% gather(year, u5mr, U5MR.1950:U5MR.2015)
tail(u5mr_long)
## CountryName year u5mr
## 12931 Uruguay U5MR.2015 10.1
## 12932 Uzbekistan U5MR.2015 39.1
## 12933 Venezuela U5MR.2015 14.9
## 12934 Samoa U5MR.2015 17.5
## 12935 Yemen U5MR.2015 41.9
## 12936 Zambia U5MR.2015 64.0
gather()
takes three arguments. The first two specify a key-value pair: year is the key and u5mr the value. The third argument specifies which variables in the original data to convert into the key-value combination (in this case, all variables from U5MR.1950
to U5MR.2015
).
Note that it would be better if year
indicated years as numeric values. This only requires one more line of code, which uses mutate()
from the dplyr
package, gsub()
to remove instances of “U5MR.
”, and as.numeric()
to convert the variable from character to numeric.
require(dplyr)
u5mr_long <- u5mr %>%
gather(year, u5mr, U5MR.1950:U5MR.2015) %>%
mutate(year = as.numeric(gsub("U5MR.", "", year)))
tail(u5mr_long)
## CountryName year u5mr
## 12931 Uruguay 2015 10.1
## 12932 Uzbekistan 2015 39.1
## 12933 Venezuela 2015 14.9
## 12934 Samoa 2015 17.5
## 12935 Yemen 2015 41.9
## 12936 Zambia 2015 64.0
Learn more about gather()
and other excellent tidyr
functions that facilitate tidy data here and here.
Exercises
Go to www.childmortality.org and download the dataset under “Estimates for under-five, infant and neonatal mortality”. (You can also download it here.) The dataset is in wide form and contains six value variables of interest: under-five (0-4 years) mortality, infant (0-1 years) mortality, neonatal (0-1 month) mortality, as well as the number of under-five, infant, and neonatal deaths.
Read the dataset into R. Note that it is in .xlsx format — there are R packages that will allow you to read such files, or you can convert it to .csv using software such as Excel and then use
read.csv()
. Note that you don’t need the first few rows of the dataset.Subset the dataset to the median estimate for each country (i.e., drop rows representing lower and upper uncertainty bounds).
- Convert the dataset to long form. The final dataset should have four variables:
country
(a character variable with the country name)year
(a numeric variable with the year)type
(a character or factor variable with six categories: “U5MR”, “IMR”, “NMR”, “Under five deaths”, “Infant deaths”, and “Neonatal deaths”).value
(a numeric variable with the value for the given country, year, and type)
For example, here are what 12 observations for Afghanistan (displaying only years 1990-91) should look like:
## country year type value
## 1 Afghanistan 1990 U5MR 181.0
## 2 Afghanistan 1990 IMR 122.5
## 3 Afghanistan 1990 NMR 52.8
## 4 Afghanistan 1990 Under five deaths 100437.0
## 5 Afghanistan 1990 Infant deaths 68718.0
## 6 Afghanistan 1990 Neonatal deaths 29658.0
## 7 Afghanistan 1991 U5MR 174.2
## 8 Afghanistan 1991 IMR 118.3
## 9 Afghanistan 1991 NMR 51.9
## 10 Afghanistan 1991 Under five deaths 101417.0
## 11 Afghanistan 1991 Infant deaths 69482.0
## 12 Afghanistan 1991 Neonatal deaths 30537.0
- How many under-five, infant, and neonatal deaths occurred in total in the world in years 1990, 1995, 2000, 2005, and 2015? (Hint: see the tutorial on collapsing data.)