Visualizing Texas High School SAT Math Scores with Bubble Grids

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two awesome things inspired this post:

As Jonas implies, using color as a visual encoding is not always the best option, a notion with which I strongly agree. Cartograms try to address the ambiguity of color encoding with distortion of land area/distance, but I think the result can be difficult to interpret. Bubble grid maps seem to me to be an interesting alternative that can potentially display information in a more direct manner.

With that being said, I decided to adapt Jonas’ code to visualize the Texas high school SAT/ACT data that I’ve looked at in other posts. To simplify the visual encoding of information, I’ll filter the data down to a single statistic—the math test scores for the SAT for the year 2015. (For other applications, the statistic might be population density, average median household income, etc.) For the geo-spatial data, I downloaded the shapefiles for schools and counties provided by the Texas Education Agency (TEA). Additionally, I downloaded shapefiles for Texas cities and for Texas highways, provided by the Texas Department of Transportation (TxDOT). By plotting the major cities and roadways in the state, the locations of “sparsely” populated areas should be evident, which can explain why there doesn’t appear to any data in some regions. Finally, I’ll also use the Texas state and county border data provided in the ggplot2::map_data() (which is essentially just serves as a wrapper for extracting data provided in the {maps} package).

library("tidyverse")
library("teplot")
library("sf")

I’ll skip over the data collection and munging steps and just show the cleaned data that I’m using. (See the GitHub repository for the full code.)

schools_tea_filt %>% glimpse()

## Observations: 1,567
## Variables: 7
## $ test     <chr> "SAT", "SAT", "SAT", "SAT", "SAT", "SAT", "SAT", "SAT...
## $ year     <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
## $ school   <chr> "A C JONES", "A M CONS", "A MACEO SMITH NEW TECH", "A...
## $ district <chr> "BEEVILLE ISD", "COLLEGE STATION ISD", "DALLAS ISD", ...
## $ county   <chr> "BEE", "BRAZOS", "DALLAS", "DALLAS", "HILL", "HALE", ...
## $ city     <chr> "CORPUS CHRISTI", "HUNTSVILLE", "RICHARDSON", "RICHAR...
## $ value    <dbl> 458, 567, 411, 428, 539, 533, 482, 538, 507, 428, 428...

schools_sf %>% glimpse()

## Observations: 8,701
## Variables: 11
## $ schl_nm  <int> 20901109, 58905001, 15909001, 15915119, 101907149, 10...
## $ school   <fct> HOOD-CASE EL, KLONDIKE ISD, SOMERSET, HOWSMAN EL, WAR...
## $ distrct  <fct> ALVIN ISD, KLONDIKE ISD, SOMERSET ISD, NORTHSIDE ISD,...
## $ city     <fct> ALVIN, LAMESA, SOMERSET, SAN ANTONIO, CYPRESS, HOUSTO...
## $ county   <fct> BRAZORIA, DAWSON, BEXAR, BEXAR, HARRIS, HARRIS, HUTCH...
## $ regn_nm  <int> 4, 17, 20, 20, 4, 4, 16, 11, 11, 4, 11, 1, 10, 11, 19...
## $ grd_grp  <fct> EE PK KG 01 02 03 04 05, EE PK KG 01 02 03 04 05 06 0...
## $ grd_gr_  <int> 1, 5, 4, 1, 1, 2, 4, 4, 4, 1, 1, 1, 4, 4, 1, 2, 1, 2,...
## $ instr_t  <fct> REGULAR INSTRUCTIONAL, REGULAR INSTRUCTIONAL, REGULAR...
## $ magnet   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
## $ geometry <POINT [°]> POINT (-95 29), POINT (-102 33), POINT (-99 29...

counties_sf %>% glimpse()

## Observations: 254
## Variables: 2
## $ county   <fct> DALLAM, BORDEN, FISHER, SHERMAN, STEPHENS, HANSFORD, ...
## $ geometry <POLYGON [°]> POLYGON ((-102 37, -102 37,..., POLYGON ((-1...

cities_sf %>% glimpse()

## Observations: 9
## Variables: 15
## $ objectid  <int> 3028, 3212, 2291, 2984, 258, 623, 863, 1402, 1658
## $ gid       <int> 2703, 3058, 2835, 2659, 40, 1165, 1405, 1074, 1988
## $ city_nm   <fct> San Antonio, Laredo, Corpus Christi, Houston, Lubboc...
## $ city_nbr  <int> 37450, 24000, 9800, 19750, 25650, 2100, 13400, 10850...
## $ inc_flag  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes
## $ cnty_seat <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes
## $ city_fips <fct> 4865000, 4841464, 4817000, 4835000, 4845000, 4805000...
## $ pop1990   <int> 935933, 122899, 257453, 1630553, 186206, 465622, 515...
## $ pop2000   <int> 1144646, 176576, 277454, 1953631, 199564, 656562, 56...
## $ pop2010   <int> 1327407, 236091, 305215, 2099451, 229573, 790390, 64...
## $ cnty_nbr  <fct> 15, 240, 178, 102, 152, 227, 72, 57, 69
## $ dist_nbr  <fct> 15, 22, 16, 12, 5, 14, 24, 18, 6
## $ x         <dbl> -98, -100, -97, -95, -102, -98, -106, -97, -102
## $ y         <dbl> 29, 28, 28, 30, 34, 30, 32, 33, 32
## $ geometry  <POINT [°]> POINT (-98 29), POINT (-100 28), POINT (-97 2...

hwys_sf %>% glimpse()

## Observations: 225
## Variables: 15
## $ fid        <dbl> 1, 8, 9, 14, 15, 19, 20, 22, 25, 26, 30, 38, 53, 69...
## $ rte_nm     <fct> SH0155-KG, US0190-KG, SH0155-KG, IH0014-KG, SH0155-...
## $ rte_prfx   <fct> SH, US, SH, IH, SH, US, SH, FM, IH, US, BS, SH, SH,...
## $ rte_nbr    <dbl> 155, 190, 155, 14, 155, 190, 155, 510, 14, 77, 71, ...
## $ rdbd_type  <fct> KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG,...
## $ begin_dfo  <dbl> 78.750, 308.163, 41.640, 283.029, 55.883, 255.923, ...
## $ end_dfo    <dbl> 123.1, 340.2, 55.0, 302.3, 68.4, 277.5, 40.1, 22.2,...
## $ asset_nm   <fct> Memorial Highway, Memorial Highway, Memorial Highwa...
## $ memorial_h <fct> Blue Star Memorial Highways, Port to Plains Highway...
## $ asset_cmnt <fct> Assigned by Minute Order, H. C. R.# 157, 5/7/85, As...
## $ des_type   <fct> Other, Other, Other, Other, Other, Other, Other, Lo...
## $ system     <fct> On, On, On, On, On, On, On, On, On, On, On, On, On,...
## $ shape_len  <dbl> 0.6718, 0.5125, 0.2142, 0.3193, 0.1902, 0.3563, 0.6...
## $ n          <int> 20, 2, 20, 2, 20, 2, 20, 1, 2, 1, 2, 20, 1, 6, 4, 4...
## $ geometry   <LINESTRING [°]> LINESTRING (-95 32, -95 32,..., LINESTR...

tx_border %>% glimpse()

## Observations: 1,088
## Variables: 6
## $ long      <dbl> -94, -94, -94, -94, -94, -94, -94, -94, -94, -94, -9...
## $ lat       <dbl> 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, ...
## $ group     <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
## $ order     <int> 12203, 12204, 12205, 12206, 12207, 12208, 12209, 122...
## $ region    <chr> "texas", "texas", "texas", "texas", "texas", "texas"...
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Here I create the grid of data that I’ll use for the visual. (Thanks to Jonas for his example.)

counties_grid_sf <-
  counties_sf %>%
  st_make_grid(n = c(40, 40))

schools_grid_sf <-
  counties_sf %>%
  left_join(schools_tea_filt) %>%
  select(value) %>%
  # NOTE: Set `extensive = FALSE` to get the mean. Otherwise, set `extensive = TRUE` for the sum.
  st_interpolate_aw(to = counties_grid_sf, extensive = FALSE) %>%
  st_centroid() %>%
  cbind(st_coordinates(.))

schools_grid_sf %>% glimpse()

## Observations: 855
## Variables: 5
## $ Group_1  <dbl> 26, 27, 28, 29, 30, 64, 65, 66, 67, 68, 69, 70, 103, ...
## $ value    <dbl> NA, NA, 465, 465, 465, NA, NA, NA, NA, 463, 465, 465,...
## $ X        <dbl> -98, -98, -98, -97, -97, -99, -99, -98, -98, -98, -97...
## $ Y        <dbl> 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 2...
## $ geometry <POINT [°]> POINT (-98 26), POINT (-98 26), POINT (-98 26)...

viz_schools_grid <-
  schools_grid_sf %>%
  ggplot() +
  geom_polygon(
    data = tx_border,
    aes(x = long, y = lat, group = group),
    size = 1.5,
    color = "black",
    fill = NA
  ) +
  geom_sf(
    data = hwys_sf,
    linetype = "solid",
    size = 0.1
  ) +
  geom_point(
    aes(x = X, y = Y, size = value, color = value),
    # show.legend = FALSE,
    shape = 16
  ) +
  geom_sf(
    data = cities_sf,
    shape = 16,
    size = 2,
    fill = "black"
  ) +
  ggrepel::geom_label_repel(
    data = cities_sf,
    aes(x = x, y = y, label = city_nm)
  ) +
  coord_sf(datum = NA) +
  scale_color_viridis_c(option = "B", na.value = "#FFFFFF") +
  teplot::theme_map(legend.position = "bottom") +
  labs(
    title = str_wrap("Texas High School Math SAT Scores, 2015", 80),
    caption = "By Tony ElHabr."
  )
viz_schools_grid

Cool! I like this visualization because it seems to offer a finer amount of detail compared to a choropleth. (In other words, it seems to emphasize specific areas in counties and not the entire county itself.) Nonetheless, there are some disadvantages of this technique.

  • There is subjectivity involved in the choice of precision for interpolation. The grid in my example seems a bit “too” granular around the San Antonio and Austin area, where it seems like there are no values at all! (Perhaps this is just an “operator error” on my behalf.)

  • sf::st_interpolate_aw() seems to only be capable of aggregating by sum (with extensive = TRUE) or mean (with extensive = FALSE). There are certainly some cases where other aggregation functions would be desirable. For my example, I actually would have preferred a maximum. An average is sensitive to area with a relatively small number of schools (that, consequently, may be “over”-emphasized by the value encoding); and a sum may too strongly emphasize areas with a large number of schools without providing any insight into their scores.

For comparison’s purposes, let’s look at what a choropleth map would look like. I’ll need an additional data.frame for this exercise—schools_tea_filt_join—which is just the schools_tea_filt data joined with counties data that can be retrieved from a call to ggplot2::map_data().

viz_schools_chlr <-
  ggplot() +
  geom_polygon(
    data = tx_border,
    aes(x = long, y = lat, group = group),
    size = 1.5,
    color = "black",
    fill = NA
  ) +
  geom_polygon(
    data = schools_tea_filt_join,
    aes(x = long, y = lat, group = group, fill = value),
  ) +
  geom_sf(
    data = hwys_sf,
    linetype = "solid",
    size = 0.1
  ) +
  geom_sf(
    data = cities_sf,
    shape = 16,
    size = 2,
    fill = "black"
  ) +
  ggrepel::geom_label_repel(
    data = cities_sf,
    aes(x = x, y = y, label = city_nm)
  ) +
  coord_sf(datum = NA) +
  scale_fill_viridis_c(option = "B", na.value = "#FFFFFF") +
  teplot::theme_map(legend.position = "bottom") +
  labs(
    title = str_wrap("Texas High School Math SAT Scores, 2015", 80),
    caption = "By Tony ElHabr."
  )
viz_schools_chlr

This choropleth actually isn’t so bad, but I think I still prefer the bubble grid.

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)