Creating datasets
Let’s start by learning how to create a dataset in R. This turns out to be very simple — just combine vectors using the data.frame()
command.
# Create three vectors
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")
# Create data frame
children <- data.frame(name, age, hair)
children
## name age hair
## 1 al 6 brown
## 2 bea 7 green
## 3 carol 4 blond
# Creating a data frame can also be done without first saving vectors
children <- data.frame(
name = c("al", "bea", "carol"),
age = c(6, 7, 4),
hair = c("brown", "green", "blond")
)
children
## name age hair
## 1 al 6 brown
## 2 bea 7 green
## 3 carol 4 blond
We created a dataset called children
, which has 3 rows and 3 columns. We used two approaches that differ in whether they first save vectors to R’s memory.
Dataset structure
More important than learning the mechanics of creating a dataset in R is to understand their general structure:
- Each column should consist of a vector that gives some fact about the world (e.g., age in years). We usually refer to these columns as variables.
- At least one column should identify who or what the information in the data is about. Such a variable is called an “id” variable or “key”. In the
children
dataset above this variable isname
. The remaining variables have the facts or measurements that we care about. For example, we gather from the dataset that Al is 6 years old (one fact) and that Al has brown hair (a second fact).
To better understand the proper structure of datasets, let’s create a second data frame. Suppose here that gdp_pc
is a measure of a country’s GDP per capita in a given year. (Use ?expand.grid
and ?runif
to learn more about these functions, though that is not a priority right now.)
countries <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
countries
## country year gdp_pc
## 1 USA 1994 16454
## 2 China 1994 13753
## 3 Sudan 1994 16899
## 4 USA 1995 16964
## 5 China 1995 2358
## 6 Sudan 1995 4262
## 7 USA 1996 5995
## 8 China 1996 8699
## 9 Sudan 1996 3242
This time around our dataset has two id variables: country
and year
. Why two and not one? One way to think about it is that country
by itself wouldn’t be sufficient to uniquely identify a row, because there are three rows for each country (and likewise with year
). Combined, however, country
and year
uniquely identify each row. In other words, GDP per capita (the only fact or measurement in this dataset) describes a given country in a given year.
We can say that the unit of analysis in the dataset countries
is country-year. This means that two id variables (country and year) are required to uniquely identify each row. In the children
dataset above the unit of analysis is “child” or “person”.
Basic commands
Here are some commands that are useful for getting to know your data and for understanding dataset structures in general.
Dimensions
The first is dim()
, which gives the dimensions of a data frame. The number of rows are listed first, columns second.
dim(countries)
## [1] 9 3
Use nrow()
and ncol()
to to get the number of rows or columns separately. These commands are useful for code generalization.
nrow(countries)
## [1] 9
ncol(countries)
## [1] 3
Snapshots
Use head()
and tail()
to look at the first and last few rows of a dataset, respectively. This is more useful when we have datasets with many observations.
head(countries)
## country year gdp_pc
## 1 USA 1994 16454
## 2 China 1994 13753
## 3 Sudan 1994 16899
## 4 USA 1995 16964
## 5 China 1995 2358
## 6 Sudan 1995 4262
tail(countries)
## country year gdp_pc
## 4 USA 1995 16964
## 5 China 1995 2358
## 6 Sudan 1995 4262
## 7 USA 1996 5995
## 8 China 1996 8699
## 9 Sudan 1996 3242
Other useful commands to get to know variables better include summary()
, table()
, and prop.table()
.
# Get some summary information about each variable
summary(countries)
## country year gdp_pc
## USA :3 Min. :1994 Min. : 2358
## China:3 1st Qu.:1994 1st Qu.: 4262
## Sudan:3 Median :1995 Median : 8699
## Mean :1995 Mean : 9847
## 3rd Qu.:1996 3rd Qu.:16454
## Max. :1996 Max. :16964
# Number of observations by country
table(countries$country)
##
## USA China Sudan
## 3 3 3
# Proportion of observations by country
prop.table(table(countries$country))
##
## USA China Sudan
## 0.3333333 0.3333333 0.3333333
Accessing specific rows and columns
Like with vectors, brackets ([]
) can be used to access data in datasets. But unlike with vectors, we need to input two arguments — separated by a comma — into the brackets. The first argument always applies to rows while the second applies to columns.
countries <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
countries
## country year gdp_pc
## 1 USA 1994 9592
## 2 China 1994 9890
## 3 Sudan 1994 9554
## 4 USA 1995 9159
## 5 China 1995 7681
## 6 Sudan 1995 7619
## 7 USA 1996 6121
## 8 China 1996 9116
## 9 Sudan 1996 2136
# Access row 2, col 3
countries[2, 3]
## [1] 9890
# Access entire row 5
countries[5, ] #note: leaving second argument blank
## country year gdp_pc
## 5 China 1995 7681
# Access entire column 3
countries[, 3] #note: leaving first argument blank
## [1] 9592 9890 9554 9159 7681 7619 6121 9116 2136
In general, though, accessing rows and columns by index is bad for code generalization. It particularly causes problems when you add or delete rows/columns, because then the indexing will change (e.g., column 3 representing GDP per capita may now be in column 4).
For this reason, it’s better to access columns using column names.
# Access a column using column/variable name (two equivalent approaches)
countries$year
## [1] 1994 1994 1994 1995 1995 1995 1996 1996 1996
countries[, "year"]
## [1] 1994 1994 1994 1995 1995 1995 1996 1996 1996
Note that when we’re accessing a column this way, it’s just a vector and all the things we’ve learned about vectors apply. For example:
# Get mean gdp per cap
mean(countries$gdp_pc)
## [1] 7874.222
To access rows, it’s best to use a logical statement, which is covered in more detail in a separate tutorial on modifying data. But just to give an example, here’s how we can access a row using bracket notation and a logical statement:
countries[countries$year == 1995 & countries$country == "USA", ]
## country year gdp_pc
## 4 USA 1995 9159
Reading data
Note: In this section we’ll be working with a dataset called world-small.csv
, which you can download here.
So far we’ve created datasets ourselves. Oftentimes, however, we’ll want to read a dataset into R from file. Datasets come in many formats — e.g., .csv, .txt, .dta, and .RData. R can read most data formats as is, but sometimes it may be necessary to manually reformat some elements in the file or even to reconvert the whole file to a different format (e.g., using Stat/Transfer). For now, we’ll assume that the file is in a readable format.
To read a file you need to
- Specify where the file is located on your computer. This is referred to as setting your working directory.
- Execute a command that will read the file from your working directory.
Setting the working directory
You can set your working directory manually. In RStudio, go to Session –> Set Working Directory –> Choose Directory… and find the folder in which your file is located.
While this works, you should also set the working directory using code. Use setwd(path-to-dir)
where path-to-dir
is the the path to the folder in which the file is located. How can you find this path? Here are instructions for Windows and mac. If you’re still not sure how to do this, take a look at this tutorial.
To check that your working directory includes the file you want to read, use dir()
without anything in the parentheses. This function outputs all the files in your working directory into the R console. So, if you want to read the world-small.csv
file that you downloaded above, you should see this file listed when you execute dir()
.
Reading the file
Now that we’ve told R where to look for our file, it’s time to read it. Different commands are used to read different types of files. This is the syntax used for reading a .csv file:
world <- read.csv("world-small.csv")
I’m reading the file from the working directory and assigning it to the object world
, which becomes of class data.frame
.
class(world)
## [1] "data.frame"
Let’s check if the file was read correctly, using dim()
(returns the dimensions), head()
(returns the top six rows), and summary()
(returns summary information about each variable):
dim(world) #the number of rows and columns
## [1] 145 4
head(world) #the first few rows of the dataset
## country region gdppcap08 polityIV
## 1 Albania C&E Europe 7715 17.8
## 2 Algeria Africa 8033 10.0
## 3 Angola Africa 5899 8.0
## 4 Argentina S. America 14333 18.0
## 5 Armenia C&E Europe 6070 15.0
## 6 Australia Asia-Pacific 35677 20.0
summary(world) #a summary of the variables in the dataset
## country region gdppcap08 polityIV
## Albania : 1 Africa :42 Min. : 188 Min. : 0.000
## Algeria : 1 C&E Europe :25 1st Qu.: 2153 1st Qu.: 7.667
## Angola : 1 Asia-Pacific:24 Median : 7271 Median :16.000
## Argentina: 1 S. America :19 Mean :13252 Mean :13.408
## Armenia : 1 Middle East :16 3rd Qu.:19330 3rd Qu.:19.000
## Australia: 1 W. Europe :12 Max. :85868 Max. :20.000
## (Other) :139 (Other) : 7
Everything looks as we would have hoped.
Exercises
Read the
world-small.csv
data into R and store it in an object calledworld
. (Set your working directory using code first.)(Conceptual) What is the unit of analysis in the dataset? What’s the name of the dataset’s id variable?
How many observations does
world
have? How many variables? Use an R command to find out.Use brackets and a logical statement to inspect all the values for Nigeria and United States. That is, your code should return two entire rows of the dataset.
Use R to return China’s Polity IV score. As in question 4, use a logical statement and brackets, but don’t return the entire row. Rather, return a single value with the Polity IV score.
What is the lowest GDP per capita in the dataset? (Use R to return only the value.)
What country has the lowest GDP per capita? (Your code should return the country name and be general enough so that if the observations in the dataset — or their order — change, your code should still return the country with the lowest GDP per capita.)