Measures of segregation and other indices of place-based inequality have been fundamental to documenting and understanding the causes and consequences of residential patterns of racial separation. In this guide you will learn how to calculate neighborhood segregation and using R. The objectives of the guide are as follows

  1. Calculate the Dissimilarity index, a measure of residential evenness.
  2. Calculate the Interaction index, a measure of residential exposure.
  3. Calculate the Location Quotient for Racial Residential Segregation, a measure of neighborhood level concentration

To accomplish these objectives, you will be working with Census tract data for four of the largest cities in California: Fresno, San Diego, San Jose, and San Francisco.

This lab guide follows closely and supplements the material presented in Chapters 4.1 and 4.2 in the textbook Geocomputation with R (GWR) and Handout 7.


Assignment 7 is due by 10:00 am, February 28th on Canvas. See here for assignment guidelines. You must submit an .Rmd file and its associated .html file. Name the files: yourLastName_firstInitial_asgn07. For example: brazil_n_asgn07.

Open up a R Markdown file


Download the Lab template into an appropriate folder on your hard drive (preferably, a folder named ‘Lab 7’), open it in R Studio, and type and run your code there. The template is also located on Canvas under Files. The template is also located on Canvas under Files. Change the title (“Lab 7”) and insert your name and date. Don’t change anything else inside the YAML (the stuff at the top in between the ---). Also keep the grey chunk after the YAML. For a rundown on the use of R Markdown, see the assignment guidelines.

Installing and loading packages


We will not be using any new packages in this lab. You’ll need to load the following packages. Unlike installing, you will always need to load packages whenever you start a new R session. As such, you’ll always need to use library() in your R Markdown file.

library(sf)
library(tidyverse)
library(tidycensus)
library(tigris)
library(tmap)
library(rmapshaper)
library(flextable)

Read in the data


The following code uses the Census API to bring in demographic tract-level data for four of the most populated cities in California: San Diego, San Jose, San Francisco, and Fresno. We won’t go through each line of code in detail because we’ve covered all of these operations and functions in prior labs. We’ve embedded comments within the code that briefly explains what each chunk is doing. Go back to prior guides (or RDS/GWR) if you need further help.

# Bring in 2016-2020 census tract data using the Census API 
ca.tracts <- get_acs(geography = "tract", 
              year = 2022,
              variables = c(tpop = "B03002_001", 
                            nhwhite = "B03002_003", nhblk = "B03002_004",
                            nhasn = "B03002_006", hisp = "B03002_012"),
              state = "CA",
              survey = "acs5",
              output = "wide",
              geometry = TRUE)

# Calculate, rename and keep essential vars. 
ca.tracts <- ca.tracts %>% 
  mutate(pnhwhite = 100*(nhwhiteE/tpopE), pnhasn = 100*(nhasnE/tpopE), 
        pnhblk = 100*(nhblkE/tpopE), phisp = 100*(hispE/tpopE)) %>%
  rename(nhwhite = nhwhiteE, nhasn = nhasnE, nhblk = nhblkE,
         hisp = hispE, tpop = tpopE) %>%
  select(GEOID,tpop, pnhwhite, pnhasn, pnhblk, phisp,
           nhwhite, nhasn, nhblk, hisp)  

# Bring in city boundaries
pl <- places(state = "CA", year = 2022, cb = TRUE)

# Keep four large cities in CA
large.cities <- filter(pl, NAME == "San Diego" |
                         NAME == "San Jose" | NAME == "San Francisco" |
                         NAME == "Fresno")

#Clip tracts in large cities 
large.tracts <- ms_clip(target = ca.tracts, 
                        clip = large.cities, remove_slivers = TRUE)

Make sure to take a look at the final outcome.

glimpse(large.tracts)
## Rows: 961
## Columns: 11
## $ GEOID    <chr> "06019003804", "06019004405", "06019004801", "06019000501", "…
## $ tpop     <dbl> 6865, 3575, 4442, 2990, 7903, 4435, 2922, 4565, 8409, 6178, 5…
## $ pnhwhite <dbl> 11.6970138, 66.7412587, 23.7055380, 7.3244147, 15.1841073, 5.…
## $ pnhasn   <dbl> 26.365623, 6.237762, 2.678973, 4.347826, 22.611666, 41.984216…
## $ pnhblk   <dbl> 8.8565186, 2.3776224, 8.5997299, 15.7525084, 6.6683538, 3.607…
## $ phisp    <dbl> 49.395484, 22.349650, 60.423233, 70.936455, 49.867139, 47.170…
## $ nhasn    <dbl> 1810, 223, 119, 130, 1787, 1862, 907, 843, 4229, 3790, 1785, …
## $ nhwhite  <dbl> 803, 2386, 1053, 219, 1200, 230, 1336, 438, 2514, 276, 18, 12…
## $ nhblk    <dbl> 608, 85, 382, 471, 527, 160, 67, 7, 405, 230, 41, 23, 0, 435,…
## $ hisp     <dbl> 3391, 799, 2684, 2121, 3941, 2092, 409, 3115, 452, 1655, 3748…
## $ geometry <POLYGON [°]> POLYGON ((-119.8668 36.7863..., POLYGON ((-119.7795 3…

The object large.tracts contains the census tracts located in the four cities. When you view the dataset, you’ll notice that we don’t have any variable indicating which city each tract belongs to. We need the city identifier to calculate segregation for each city. The city GEOID and NAME are in the object large.cities, which we will need to append to each tract in the object large.tracts.

We do this by using the st_join() function, which is a part of the sf package. The function will join the variables from large.cities to large.tracts based on geographic location. That is, if a tract is located within a city, that city’s values from large.cities will be appended to that tract.

First, look at the variables already in large.tracts.

names(large.tracts)
##  [1] "GEOID"    "tpop"     "pnhwhite" "pnhasn"   "pnhblk"   "phisp"   
##  [7] "nhasn"    "nhwhite"  "nhblk"    "hisp"     "geometry"

Then look at the variables in large.cities

names(large.cities)
##  [1] "STATEFP"    "PLACEFP"    "PLACENS"    "AFFGEOID"   "GEOID"     
##  [6] "NAME"       "NAMELSAD"   "STUSPS"     "STATE_NAME" "LSAD"      
## [11] "ALAND"      "AWATER"     "geometry"

Then st_join()

large.tracts <- large.tracts %>%
                st_join(large.cities)

This function joins the variables from large.cities to the object large.tracts.

names(large.tracts)
##  [1] "GEOID.x"    "tpop"       "pnhwhite"   "pnhasn"     "pnhblk"    
##  [6] "phisp"      "nhasn"      "nhwhite"    "nhblk"      "hisp"      
## [11] "STATEFP"    "PLACEFP"    "PLACENS"    "AFFGEOID"   "GEOID.y"   
## [16] "NAME"       "NAMELSAD"   "STUSPS"     "STATE_NAME" "LSAD"      
## [21] "ALAND"      "AWATER"     "geometry"

Note that when the two files have the same variable names, R attaches .x and .y to the end of the variable names such as GEOID.x and GEOID.y, which represent the tract and city GEOIDs, respectively.

We don’t need all of these new variables, so let’s use select() to remove the variables we don’t need.

large.tracts <- large.tracts %>%
                select(-(STATEFP:AFFGEOID), -(NAMELSAD:AWATER))

Make sure we’ve kept the variables we need

names(large.tracts)
##  [1] "GEOID.x"  "tpop"     "pnhwhite" "pnhasn"   "pnhblk"   "phisp"   
##  [7] "nhasn"    "nhwhite"  "nhblk"    "hisp"     "GEOID.y"  "NAME"    
## [13] "geometry"

Mapping


Before calculating segregation, you should map neighborhood racial/ethnic composition in order to gain a visual understanding of how race/ethnic groups are spatially distributed in your study region. For example, let’s map percent Hispanic in Fresno.

large.tracts %>%
  filter(NAME == "Fresno") %>%
  tm_shape(unit = "mi") +
    tm_polygons(col = "phisp", style = "quantile",palette = "Reds", 
              border.alpha = 0, title = "") +
    tm_scale_bar(breaks = c(0, 1, 2), text.size = 0.75, position = c("right", "bottom")) + 
  tm_compass(type = "4star", position = c("left", "top")) +
  tm_layout(main.title = "Percent Hispanic in Fresno City Tracts", 
            main.title.size = 0.9, frame = FALSE)

How does this spatial distribution compare to percent non-Hispanic white?

large.tracts %>%
  filter(NAME == "Fresno") %>%
tm_shape(unit = "mi") +
  tm_polygons(col = "pnhwhite", style = "quantile",palette = "Reds", 
              border.alpha = 0, title = "") +
  tm_scale_bar(breaks = c(0, 1, 2), text.size = 0.75, position = c("right", "bottom")) +  
    tm_compass(type = "4star", position = c("left", "top")) +
  tm_layout(main.title = "Percent White in Fresno City Tracts", 
            main.title.size = 0.9,
            frame = FALSE)

It looks like a North/South divide. Map the other two race/ethnic groups in Fresno and all the groups in the other three cities.

Dissimilarity Index


The most common measure of residential evenness is the Dissimilarity Index D. To calculate D, we’ll follow the Dissimilarity index formula on page 3 of Handout 7. We will calculate Black/White, Hispanic/White, and Asian/White Dissimilarity.

We already have the values \(t_{im}\), and \(t_{ik}\), which is the total population of race/ethnic group \(m\) and \(k\) in each census tract. But we don’t have the total population of race/ethnic group \(m\) and \(k\) for each city. This is the value \(T_m\) and \(T_k\) in the formula. To calculate these values, we use the group_by() and mutate() functions.

large.tracts <- large.tracts %>%
      group_by(NAME) %>%
      mutate(nhwhitec = sum(nhwhite), nhasnc = sum(nhasn), 
             nhblkc = sum(nhblk), hispc = sum(hisp), 
             tpopc = sum(tpop)) %>%
      ungroup()

We already covered group_by() in Lab 4, but as a reminder, the group_by() function tells R that all future functions on large.tracts will be grouped according to the variable NAME, which is the city name. We use the sum() function within the mutate() function to sum up, for example, the non-Hispanic white population nhwhite for each city. We name this variable nhwhitec. If you type in View(large.tracts), you should find that the variable nhwhitec provides the same value for all tracts within the same city. We do this for all the other race/ethnic groups.

The function ungroup() at the end of the code tells R to stop the grouping. It’s always good practice to ungroup() a data set if you are saving it for future use (rather than using it as a summary table as we’ve been doing so far in the class).

Now we can calculate the rest of the formula, breaking it down piece-by-piece like we did in the handout and in lecture.

large.tracts %>%
      group_by(NAME) %>%
      mutate(d.wb = abs(nhblk/nhblkc-nhwhite/nhwhitec),
              d.wa = abs(nhasn/nhasnc-nhwhite/nhwhitec), 
              d.wh = abs(hisp/hispc-nhwhite/nhwhitec)) %>%
      summarize(BWD = 0.5*sum(d.wb, na.rm=TRUE), AWD = 0.5*sum(d.wa, na.rm=TRUE),
                HWD = 0.5*sum(d.wh, na.rm=TRUE)) %>%
      ungroup()
## Simple feature collection with 4 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -123.0139 ymin: 32.53486 xmax: -116.9057 ymax: 37.86334
## Geodetic CRS:  NAD83
## # A tibble: 4 × 5
##   NAME            BWD   AWD   HWD                                       geometry
##   <chr>         <dbl> <dbl> <dbl>                             <MULTIPOLYGON [°]>
## 1 Fresno        0.463 0.423 0.388 (((-119.8897 36.67738, -119.8895 36.68476, -1…
## 2 San Diego     0.555 0.483 0.509 (((-117.1706 32.7007, -117.1671 32.69915, -11…
## 3 San Francisco 0.546 0.404 0.402 (((-122.3885 37.7897, -122.3923 37.79389, -12…
## 4 San Jose      0.466 0.472 0.468 (((-121.8237 37.20721, -121.8184 37.20481, -1…


Let’s break the code down so we’re all on the same page.

  • We use group_by() because we want to calculate Dissimilarity for each city, which is indicated by the variable NAME.
  • We use mutate() to calculate the tract level contributions to the index, i.e. the value \(\left|\frac{t_{rm}}{T_m} - \frac{t_{rk}}{T_k}\right|\) in Equation 1 in Handout 7 for each neighborhood \(i\).
  • Next, we turn to summarize() to finish the rest of the job. Within summarize(), we use the function sum() to add the neighborhood specific values in Equation 1 in Handout 7. In other words, sum() is performing the \(\sum\limits_{i}^{N}\) that adds up \(\left|\frac{t_{rm}}{T_m} - \frac{t_{rk}}{T_k}\right|\).
  • Finally, multiply the summed up value by 0.5 to get the final index values.


The resulting values provide the Dissimilarity indices for Black/White (BWD), Asian/White (AWD), and Hispanic/White (HWD). In all of these cases, we calculate segregation from white residents, but you can calculate segregation for any race/ethnicity combination (e.g. Black/Hispanic). Instead of just copying and pasting the chunk of code above into your console, make sure you understand what each line of code is doing. Not only will it help you become a more seasoned R coder, but it will also help you better understand the underlying math behind the Dissimilarity index.


The results table we got above is a little messy. Let’s clean it up by doing three things: (1) Drop the geometry column using st_drop_geometry(), which is a part of the sf package, thus making the object large.tracts no longer spatial; (2) use the flextable() function to make a nicely formatted table; and (3) save the resulting table in an object we named dis.table. The st_drop_geometry() function removes the geometry variable, and thus makes the object large.tracts no longer spatial. We save the table into an object named dis.table

dis.table <- large.tracts %>%
      group_by(NAME) %>%
      mutate(d.wb = abs(nhblk/nhblkc-nhwhite/nhwhitec),
              d.wa = abs(nhasn/nhasnc-nhwhite/nhwhitec), 
              d.wh = abs(hisp/hispc-nhwhite/nhwhitec)) %>%
      summarize(BWD = 0.5*sum(d.wb, na.rm=TRUE), AWD = 0.5*sum(d.wa, na.rm=TRUE),
                HWD = 0.5*sum(d.wh, na.rm=TRUE)) %>%
      ungroup() %>%
      st_drop_geometry() %>%
      flextable() 

dis.table %>%
  colformat_double(j = c("BWD", "AWD", "HWD"), digits = 3)

NAME

BWD

AWD

HWD

Fresno

0.463

0.423

0.388

San Diego

0.555

0.483

0.509

San Francisco

0.546

0.404

0.402

San Jose

0.466

0.472

0.468


Looks much better. The Dissimilarity index for Black/White in Fresno is 0.463. The interpretation of this value is that 46.3% of black residents would need to move neighborhoods in order to achieve a uniform distribution of black and white residents across neighborhoods in the city.

Interaction Index


The most common measure of exposure is the Interaction Index \(P^*\). Let’s calculate the exposure of black (BWI), Asian (AWI), Hispanic (HWI) residents to white residents using the formula on page 6 of Handout 7.

int.table <-large.tracts %>%
      group_by(NAME) %>%
      mutate(i.wb = (nhblk/nhblkc)*(nhwhite/tpop),
              i.wa = (nhasn/nhasnc)*(nhwhite/tpop), 
              i.wh = (hisp/hispc)*(nhwhite/tpop)) %>%
      summarize(BWI = sum(i.wb, na.rm=TRUE), AWI = sum(i.wa, na.rm=TRUE),
                HWI = sum(i.wh, na.rm=TRUE)) %>%
      ungroup() %>%
      st_drop_geometry() %>%
      flextable()

Look at the Interaction index equation in Handout 7. The mutate() function is creating the tract specific values \(\frac{t_{im}}{T_m} * \frac{t_{ik}}{t_i}\). We then turn to summarize() to perform the \(\sum\limits_{i}^{N}\).

We present the results in a nice table using the function flextable().

int.table %>%
  colformat_double(j = c("BWI", "AWI", "HWI"), digits = 3)

NAME

BWI

AWI

HWI

Fresno

0.208

0.238

0.213

San Diego

0.304

0.357

0.282

San Francisco

0.290

0.305

0.332

San Jose

0.251

0.205

0.204


The probability of a Black resident “interacting” with a white person in his or her neighborhood is about 20.8% in Fresno. We can also interpret this to mean that 21 of every 100 people a Black person meets in his or her neighborhood will be white. Remember that interaction is not symmetric. Calculate the interaction of white residents with Black residents in the other cities and see if there are major differences with the values we calculated above.

Location Quotient


The Dissimilarity and Interaction indices are city-level indices. In the handout, we covered one neighborhood-level measure: Location Quotient for Racial Residential Segregation (LQRSS), which captures neighborhood racial/ethnic concentration.

Let’s zoom into the City of Fresno and calculate the LQRSS for each of its tracts. First, keep Fresno tracts from large.tracts using the filter() command and calculate the LQRSS for blacks, Asians, Hispanics, and whites using equation (3) in this week’s handout.

fresno.tracts <- large.tracts %>%
  filter(NAME == "Fresno") %>%
  mutate(blklq = (nhblk/tpop)/(nhblkc/tpopc), 
        asnlq = (nhasn/tpop)/(nhasnc/tpopc),
        hisplq = (hisp/tpop)/(hispc/tpopc),
        whitelq = (nhwhite/tpop)/(nhwhitec/tpopc))

The census tract with GEOID of 06019004217 has a black LQ of 3.96. In your own words, what does this value represent?

You can visualize the distribution using a histogram (or boxplot). For example, a histogram of the black LQ looks like

fresno.tracts %>% 
  ggplot() + 
    geom_histogram(mapping = aes(x=blklq), na.rm=TRUE) +
    xlab("Black Location Quotient") 

The skewness of the distribution indicates significant concentration of the black population in Fresno. We can also map the LQRSS. Let’s use the viewing feature in tmap so we can zoom in and out, and identify the GEOIDs with the tracts with high or low Black location quotients.

tmap_mode("view")
tm_shape(fresno.tracts, unit = "mi") +
  tm_polygons(col = "blklq", style = "quantile",palette = "Reds", 
              border.alpha = 0, title = "Black Location Quotient") 

The map indicates that there are some neighborhoods in the city that have a percent black population that is as high as 4 times the overall percent black population in the city.

Assignment 7


Download and open the Assignment 7 R Markdown Script. The script can also be found on Canvas (Files - Week 7 - Assignment). Any response requiring a data analysis task must be supported by code you generate to produce your result. (Just examining your various objects in the “Environment” section of R Studio is insufficient—you must use scripted commands.).


  1. In Assignment 6, we descriptively examined the claim that Houston is the most racially integrated city in the United States. Let’s employ the segregation tools we learned this week to explore this claim even further. Let’s also compare Houston to Sacramento, a city that has also been proclaimed as among most racially diverse in the nation. Read in the Houston and Sacramento shapefiles houstondems.shp and sacdems.shp, which can be found on Canvas in the zipped folder Assignment7.zip (Files - Week 7 - Assignment). A record layout of the data can be found here.
  1. Calculate the Black/White, Hispanic/White and Asian/White Dissimilarity Indices for Houston and Sacramento. Present these values in a presentation-ready table(s). (3 points)
  2. Calculate the Black/White, Hispanic/White and Asian/White Interaction Indices for Houston and Sacramento. Present these values in a presentation-ready table(s). (3 points)
  3. Based on your answers to questions (a) and (b), which city is most segregated? Why? (2 points)
  4. Instead of examining segregation at the city level, let’s find where it exists at the neighborhood level. Calculate the Location Quotient for Racial Residential Segregation (LQRSS) for the Hispanic, White, Black and Asian populations for each city. (1 point)
  5. Show presentation-ready maps of the Hispanic LQRSS for each city. (2 points)
  6. Let’s examine the socioeconomic variables that may be correlated with neighborhood Hispanic concentration in each city. Calculate the correlation between the Hispanic LQRSS and percent of residents under 18 years old, percent of residents between 22 and 34, and percent foreign born in Houston. Do the same for Sacramento. Summarize the results in your own words, noting differences and similarities between the two cities. (2 points)


  1. You will be calculating two-group segregation indices for the cities of Detroit and Los Angeles. Read in the Detroit and Los Angeles data files detroitrace.csv and losangelesrace.csv, which can be found on Canvas in the zipped folder Assignment7.zip (Files - Week 7 - Assignment). You do not need to convert these data sets into sf objects. Keep them as regular data frames (tibbles). A record layout of the data can be found here.
  1. Calculate the Black/White, Hispanic/White, and Asian/White Dissimilarity indices for Detroit and Los Angeles. Present these values in a presentation-ready table(s). (3 points)
  2. Calculate the Black/White, Hispanic/White, and Asian/White Interaction indices for Detroit and Los Angeles. Present these values in a presentation-ready table(s). (3 points)
  3. Intuitively, if you get a high Dissimilarity index, you should get a low Interaction index. Comparing Detroit and Los Angeles Asian/White and Hispanic/White segregation, we find this to be the case. However, we find that Los Angeles has a larger Black/White Dissimilarity index than Detroit, but has a higher Black/White Interaction index. What is a good explanation for this finding? (1 point)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Website created and maintained by Noli Brazil