Set-up
Let’s load some packages we’ll need throughout:
pkgs <- c("ggplot2", "gridExtra", "grid", "RColorBrewer", "rgdal", "rgeos",
"maptools", "seg", "splancs", "dplyr")
sapply(pkgs, require, character.only = TRUE)
## ggplot2 gridExtra grid RColorBrewer rgdal
## TRUE TRUE TRUE TRUE TRUE
## rgeos maptools seg splancs dplyr
## TRUE TRUE TRUE TRUE TRUE
If you don’t have them installed, use install.packages
before loading them. Let’s also set some graphing parameters for ggplot
:
theme <- theme_bw() + theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
strip.background = element_blank(),
strip.text.x = element_text(size = 10)
)
Data for a stylized example
Let’s begin by generating a spatial grid that serves as a “city” with 16 “neighborhoods”:
# Set dimensions of grid, cell size, and number of localities
grid.dim <- c(4, 4)
cell.size <- 1
no.loc <- grid.dim[1] * grid.dim[2]
# Generate grid
grd <- GridTopology(c(0, 0), c(cell.size, cell.size), grid.dim)
grd.sp <- as.SpatialPolygons.GridTopology(grd)
polyg <- fortify(grd.sp) #for ggplot
Here’s how to plot this grid using both R
’s base functions and ggplot
. (Side note: ggplot
looks intimidating but is much more powerful when graphing complex spatial data.)
plot(grd.sp, main = "(base)")
ggplot(grd.sp, aes(x = long, y = lat)) +
geom_polygon(aes(group = id), color = "black", fill = "white") +
coord_equal() +
theme +
xlab("") + ylab("") + ggtitle("(ggplot)")
Let’s also create a second grid with only four neighborhoods:
ids <- c(rep(c(1, 1, 2, 2), 2), rep(c(3, 3, 4, 4), 2))
grd.sp2 <- gUnionCascaded(grd.sp, id = ids)
polyg2 <- fortify(grd.sp2) #for ggplot
Now we’ll generate a population for each of six cities, assuming we know the demographic makeup of neighborhoods. The result is graphed below.
# Set number of members of each group
g1 <- 1000
g2 <- 1000
# Data for three segregated cities
dta.seg1 <- data.frame(
id = 1:no.loc,
group1 = c(rep(g1 / (no.loc/2), no.loc/2), rep(0, no.loc/2)),
group2 = c(rep(0, no.loc/2), rep(rep(g2 / (no.loc/2), no.loc/2))),
j = "(a)"
)
dta.seg2 <- data.frame(
id = 1:no.loc,
group1 = c(rep(g1 / (no.loc/4), no.loc/4), rep(0, no.loc * 3/4)),
group2 = c(rep(0, no.loc/2), rep(rep(g2 / (no.loc/2), no.loc/2))),
j = "(b)"
)
dta.seg3 <- data.frame(
id = 1:no.loc,
group1 = c(rep(0, no.loc * 1/4), rep(g1 / (no.loc/4), no.loc/4), rep(0, no.loc / 2)),
group2 = c(rep(0, no.loc/2), rep(rep(g2 / (no.loc/2), no.loc/2))),
j = "(c)"
)
# Integrated city
dta.int <- data.frame(
id = 1:no.loc,
group1 = rep(g1 / no.loc, no.loc),
group2 = rep(g2 / no.loc, no.loc),
j = "(d)"
)
# Two checkerboard cities
dta.cb1 <- data.frame(
id = 1:no.loc,
group1 = rep(c(g1 / (no.loc/2), 0), no.loc/2),
group2 = rep(c(0, g2 / (no.loc/2)), no.loc/2),
j = "(e)"
)
dta.cb1$group1[c(5:8, 13:16)] <- dta.cb1$group1[c(2:5, 2:5)]
dta.cb1$group2[c(5:8, 13:16)] <- dta.cb1$group2[c(2:5, 2:5)]
dta.cb2 <- dta.cb1 %>% mutate(id = ids, j = "(f)")
dta.cb2.collapse <- dta.cb2 %>% group_by(id) %>% select(-j) %>% summarise_each(funs(sum)) %>% mutate(j = "(f)")
# Plot
dta.list <- list(dta.seg1, dta.seg2, dta.seg3, dta.int, dta.cb1, dta.cb2)
plots <- lapply(dta.list, function(dta) {
g1 <- dotsInPolys(grd.sp, as.integer(dta$group1))
g2 <- dotsInPolys(grd.sp, as.integer(dta$group2))
points <- rbind(data.frame(coordinates(g1), group = "group 1"),
data.frame(coordinates(g2), group = "group 2"))
if(length(unique(dta$id)) < no.loc) polyg <- polyg2
ggplot(data = polyg, aes(x = long, y = lat)) +
geom_polygon(aes(group = id), color = "black", fill = "white", size = 1.1) +
geom_point(data = points, aes(x = x, y = y, color = group, shape = group), size = 1.4) +
scale_color_manual(values = c("black", "#1f78b4"), guide = F) +
scale_shape_manual(values = c(16, 0), guide = F) +
coord_equal() +
labs(x = NULL, y = NULL) +
ggtitle(as.character(dta$j[1])) +
theme
})
do.call(grid.arrange, c(plots, nrow = 2))
So, we have six cities with different degrees of segregation. Imagine that each point represents an individual and that dots and squares represent different groups. The black lines within each city delineate neighborhoods. (a)-(e) have 16 neighborhoods, while (f) has four.
Note that individuals in each neighborhood are randomly distributed (except in f). In reality, we don’t know where people live in each neighborhood.
In (f), which is the only city with only four neighborhoods, we assume that we actually have some idea where people live within each neighborhood. But, for whatever reason, we only have data at a more aggregate level (across four neighborhoods instead of 16). Below, I show how things can go very wrong in this case, illustrating an issue known as the modifiable areal unit problem (MAUP).
Calculate measures of segregation
Using our hypothetical cities (a)-(f), we can now calculate measures of segregation for each. Intuitively, if we were to order the cities according to how segregated they are, we’d probably come up with the following ranking:
(b) > (a) > (c) > (e), (f) > (d)
That is, (b) is most segregated and (d) is least segregated. Do measures of segregation actually capture these differences? To find out, we’ll calculate the Dissimiliarity Index – the most common “non-spatial” measure of segregation – as well as two “spatial” measures of segregation: Spatial Dissimilarity and Information Theory. See Reardon and O’Sullivan, “Measures of Spatial Segregation” for a detailed discussion of spatial segregation measures.
All these measures range between 0 (complete integration) and 1 (complete segregation). The Dissimilarity Index captures the deviation between a city’s demographic makeup and the demographic makeup of each of its neighborhoods. In the case of complete integration, all neighborhoods would have the same diversity as the city as a whole.
Spatial segregation measures are different in that they also account for clustering of individuals across neighborhoods. For example, the Spatial Dissimilarity Index measures the deviation between the makeup of the city and individuals’ “local environment,” where the local environment can consist of (parts of) several neighboring neighborhoods. See Reardon and O’Sullivan for further details.
Let’s write a function that outputs a set of segregation measures for a given city. The dissim
and spseg
functions used here are from package seg
.
calculate_seg <- function(polyg, dta, gr1, gr2, bandwidth) {
# Subset the data to the groups of interest
dta.sub <- dta[, c(gr1, gr2)]
# Find non-spatial dissimilarity index using 'dissim' from pkg 'seg'
dis.nspat <- dissim(data = dta.sub)$d
# Find two spatial segregation measures
sp.seg <- spseg(polyg, dta.sub, smoothing = "kernel", sigma = bandwidth)
dis.kern <- round(as.list(sp.seg)$d, 3) #dissimilarity
inf.kern <- round(as.list(sp.seg)$h, 3) #information theory
data.frame(
city = dta$j[1],
measure = c("Non-Spatial Dissimilarity", "Spatial Dissimilarity", "Spatial Information Theory"),
bandwidth = c(NA, rep(bandwidth, 2)),
value = c(dis.nspat, dis.kern, inf.kern)
)
}
The function takes as its inputs:
A
SpatialPolygons
object.Data associated with the
SpatialPolygons
object. The association between the polygons and data is by index, i.e., order. A different approach would involve using aSpatialPolygonsDataFrame
in which each observation in the dataframe is directly associated with a polygon.The groups of interest: segregation will be calculated for
gr1
with respect togr2
.A bandwidth, which specifies how much to smooth the population when calculating spatial segregation measures (not applicable for non-spatial segregation measures). A larger bandwidth will usually result in a smaller segregation score, since larger bandwidths incorporate larger areas and thus a larger share of the population (which usually will be more diverse).
Let’s test it using city (a) from above:
calculate_seg(grd.sp, dta.seg1, "group1", "group2", 0.5)
## city measure bandwidth value
## 1 (a) Non-Spatial Dissimilarity NA 1.000
## 2 (a) Spatial Dissimilarity 0.5 0.655
## 3 (a) Spatial Information Theory 0.5 0.399
Let’s run this function for all cities, and graph the result:
dta.list <- list(dta.seg1, dta.seg2, dta.seg3, dta.int, dta.cb1, dta.cb2)
p <- lapply(dta.list, function(dta) {
g1 <- dotsInPolys(grd.sp, as.integer(dta$group1))
g2 <- dotsInPolys(grd.sp, as.integer(dta$group2))
points <- rbind(data.frame(coordinates(g1), group = "group 1"),
data.frame(coordinates(g2), group = "group 2"))
dta2 <- dta
if(length(unique(dta$id)) < no.loc) {
polyg <- polyg2
grd.sp <- grd.sp2
dta2 <- dta.cb2.collapse
}
g1 <- ggplot(data = polyg, aes(x = long, y = lat)) +
geom_polygon(aes(group = id), color = "black", fill = "white", size = 1.1) +
geom_point(data = points, aes(x = x, y = y, color = group, shape = group)) +
scale_color_manual(values = c("black", "#1f78b4"), guide = F) +
scale_shape_manual(values = c(16, 0), guide = F) +
ylab("") +
xlab("") +
ggtitle(as.character(dta$j[1])) +
coord_equal() +
theme +
theme(plot.margin=unit(c(0,0,-0.5,0), "cm"))
seg <- rbind(calculate_seg(grd.sp, dta2, "group1", "group2", 0.2),
calculate_seg(grd.sp, dta2, "group1", "group2", 0.7)[-c(1), ])
g2 <- tableGrob(seg, rows = rep("", 5))
arrangeGrob(g1, g2, ncol = 1, heights = c(0.7, 0.3))
})
plots <- arrangeGrob(p[[1]], p[[2]], p[[3]], p[[4]], p[[5]], p[[6]], nrow = 2)
grid.newpage()
grid.draw(plots)
Take a look at the table below each figure and see whether the segregation scores make intuitive sense. The exercise highlights a few important points:
Non-spatial measures of segregation don’t capture important differences in segregation across cities. Cities (a), (b), (c), and (e) have the same Non-Spatial Dissimilarity score – 1 – yet very different levels of segregation. The reason for the identical scores is that Non-Spatial Dissimilarity, like all non-spatial measures of segregation, captures the deviation between a city’s diversity and the diversity of each of its neighborhoods without taking into account where the neighborhoods themselves are located.
Spatial measures do capture some important differences across cities. For example, according to both spatial measures, (b) is more segregated than (a), which in turn is more segregated than (c). This is how we intuitively ranked these cities above.
A higher bandwidth means lower segregation. For example, Spatial Dissimilarity in (a) is 0.730 using a bandwidth of 0.2 and 0.624 using a bandwidth of 0.7.
MAUP is a serious issue regardless of measure. To your computer, (f) is identical to (d) – resulting in a segregation score of 0 for both cities. You can spend a lot of time trying to pick a good measure of segregation, but if your data are measured at the wrong level you will be drawing inaccurate conclusions. Put differently, good measurement can’t make up for bad data.