Visualizing Texas High School SAT Math Scores with Bubble Grids

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two awesome things inspired this post:

As Jonas implies, using color as a visual encoding is not always the
best option, a notion with which I strongly agree.
Cartograms try to address the
ambiguity of color encoding with distortion of land area/distance, but I
think the result can be difficult to interpret. Bubble grid maps seem to
me to be an interesting alternative that can potentially display
information in a more direct manner.

With that being said, I decided to adapt Jonas’ code to visualize the
Texas high school SAT/ACT data that I’ve looked at in other posts. To
simplify the visual encoding of information, I’ll filter the data down
to a single statistic—the math test scores for the SAT for the year
2015. (For other applications, the statistic might be population
density, average median household income, etc.) For the geo-spatial
data, I downloaded the shapefiles for
schools
and
counties
provided by the Texas Education Agency (TEA).
Additionally, I downloaded shapefiles for Texas
cities

and for Texas
highways
,
provided by the Texas Department of
Transportation
(TxDOT). By plotting the major
cities and roadways in the state, the locations of “sparsely” populated
areas should be evident, which can explain why there doesn’t appear to
any data in some regions. Finally, I’ll also use the Texas state and
county border data provided in the ggplot2::map_data() (which is
essentially just serves as a wrapper for extracting data provided in the
{maps} package).

library("tidyverse")
library("teplot")
library("sf")

I’ll skip over the data collection and munging steps and just show the
cleaned data that I’m using. (See the GitHub repository for the full
code.)

schools_tea_filt %>% glimpse()

## Observations: 1,567
## Variables: 7
## $ test     <chr> "SAT", "SAT", "SAT", "SAT", "SAT", "SAT", "SAT", "SAT...
## $ year     <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
## $ school   <chr> "A C JONES", "A M CONS", "A MACEO SMITH NEW TECH", "A...
## $ district <chr> "BEEVILLE ISD", "COLLEGE STATION ISD", "DALLAS ISD", ...
## $ county   <chr> "BEE", "BRAZOS", "DALLAS", "DALLAS", "HILL", "HALE", ...
## $ city     <chr> "CORPUS CHRISTI", "HUNTSVILLE", "RICHARDSON", "RICHAR...
## $ value    <dbl> 458, 567, 411, 428, 539, 533, 482, 538, 507, 428, 428...

schools_sf %>% glimpse()

## Observations: 8,701
## Variables: 11
## $ schl_nm  <int> 20901109, 58905001, 15909001, 15915119, 101907149, 10...
## $ school   <fct> HOOD-CASE EL, KLONDIKE ISD, SOMERSET, HOWSMAN EL, WAR...
## $ distrct  <fct> ALVIN ISD, KLONDIKE ISD, SOMERSET ISD, NORTHSIDE ISD,...
## $ city     <fct> ALVIN, LAMESA, SOMERSET, SAN ANTONIO, CYPRESS, HOUSTO...
## $ county   <fct> BRAZORIA, DAWSON, BEXAR, BEXAR, HARRIS, HARRIS, HUTCH...
## $ regn_nm  <int> 4, 17, 20, 20, 4, 4, 16, 11, 11, 4, 11, 1, 10, 11, 19...
## $ grd_grp  <fct> EE PK KG 01 02 03 04 05, EE PK KG 01 02 03 04 05 06 0...
## $ grd_gr_  <int> 1, 5, 4, 1, 1, 2, 4, 4, 4, 1, 1, 1, 4, 4, 1, 2, 1, 2,...
## $ instr_t  <fct> REGULAR INSTRUCTIONAL, REGULAR INSTRUCTIONAL, REGULAR...
## $ magnet   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
## $ geometry <POINT [°]> POINT (-95 29), POINT (-102 33), POINT (-99 29...

counties_sf %>% glimpse()

## Observations: 254
## Variables: 2
## $ county   <fct> DALLAM, BORDEN, FISHER, SHERMAN, STEPHENS, HANSFORD, ...
## $ geometry <POLYGON [°]> POLYGON ((-102 37, -102 37,..., POLYGON ((-1...

cities_sf %>% glimpse()

## Observations: 9
## Variables: 15
## $ objectid  <int> 3028, 3212, 2291, 2984, 258, 623, 863, 1402, 1658
## $ gid       <int> 2703, 3058, 2835, 2659, 40, 1165, 1405, 1074, 1988
## $ city_nm   <fct> San Antonio, Laredo, Corpus Christi, Houston, Lubboc...
## $ city_nbr  <int> 37450, 24000, 9800, 19750, 25650, 2100, 13400, 10850...
## $ inc_flag  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes
## $ cnty_seat <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes
## $ city_fips <fct> 4865000, 4841464, 4817000, 4835000, 4845000, 4805000...
## $ pop1990   <int> 935933, 122899, 257453, 1630553, 186206, 465622, 515...
## $ pop2000   <int> 1144646, 176576, 277454, 1953631, 199564, 656562, 56...
## $ pop2010   <int> 1327407, 236091, 305215, 2099451, 229573, 790390, 64...
## $ cnty_nbr  <fct> 15, 240, 178, 102, 152, 227, 72, 57, 69
## $ dist_nbr  <fct> 15, 22, 16, 12, 5, 14, 24, 18, 6
## $ x         <dbl> -98, -100, -97, -95, -102, -98, -106, -97, -102
## $ y         <dbl> 29, 28, 28, 30, 34, 30, 32, 33, 32
## $ geometry  <POINT [°]> POINT (-98 29), POINT (-100 28), POINT (-97 2...

hwys_sf %>% glimpse()

## Observations: 225
## Variables: 15
## $ fid        <dbl> 1, 8, 9, 14, 15, 19, 20, 22, 25, 26, 30, 38, 53, 69...
## $ rte_nm     <fct> SH0155-KG, US0190-KG, SH0155-KG, IH0014-KG, SH0155-...
## $ rte_prfx   <fct> SH, US, SH, IH, SH, US, SH, FM, IH, US, BS, SH, SH,...
## $ rte_nbr    <dbl> 155, 190, 155, 14, 155, 190, 155, 510, 14, 77, 71, ...
## $ rdbd_type  <fct> KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG, KG,...
## $ begin_dfo  <dbl> 78.750, 308.163, 41.640, 283.029, 55.883, 255.923, ...
## $ end_dfo    <dbl> 123.1, 340.2, 55.0, 302.3, 68.4, 277.5, 40.1, 22.2,...
## $ asset_nm   <fct> Memorial Highway, Memorial Highway, Memorial Highwa...
## $ memorial_h <fct> Blue Star Memorial Highways, Port to Plains Highway...
## $ asset_cmnt <fct> Assigned by Minute Order, H. C. R.# 157, 5/7/85, As...
## $ des_type   <fct> Other, Other, Other, Other, Other, Other, Other, Lo...
## $ system     <fct> On, On, On, On, On, On, On, On, On, On, On, On, On,...
## $ shape_len  <dbl> 0.6718, 0.5125, 0.2142, 0.3193, 0.1902, 0.3563, 0.6...
## $ n          <int> 20, 2, 20, 2, 20, 2, 20, 1, 2, 1, 2, 20, 1, 6, 4, 4...
## $ geometry   <LINESTRING [°]> LINESTRING (-95 32, -95 32,..., LINESTR...

tx_border %>% glimpse()

## Observations: 1,088
## Variables: 6
## $ long      <dbl> -94, -94, -94, -94, -94, -94, -94, -94, -94, -94, -9...
## $ lat       <dbl> 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, ...
## $ group     <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, ...
## $ order     <int> 12203, 12204, 12205, 12206, 12207, 12208, 12209, 122...
## $ region    <chr> "texas", "texas", "texas", "texas", "texas", "texas"...
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Here I create the grid of data that I’ll use for the visual. (Thanks to
Jonas for his
example
.)

counties_grid_sf <-
  counties_sf %>%
  st_make_grid(n = c(40, 40))

schools_grid_sf <-
  counties_sf %>%
  left_join(schools_tea_filt) %>%
  select(value) %>%
  # NOTE: Set `extensive = FALSE` to get the mean. Otherwise, set `extensive = TRUE` for the sum.
  st_interpolate_aw(to = counties_grid_sf, extensive = FALSE) %>%
  st_centroid() %>%
  cbind(st_coordinates(.))

schools_grid_sf %>% glimpse()

## Observations: 855
## Variables: 5
## $ Group_1  <dbl> 26, 27, 28, 29, 30, 64, 65, 66, 67, 68, 69, 70, 103, ...
## $ value    <dbl> NA, NA, 465, 465, 465, NA, NA, NA, NA, 463, 465, 465,...
## $ X        <dbl> -98, -98, -98, -97, -97, -99, -99, -98, -98, -98, -97...
## $ Y        <dbl> 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 2...
## $ geometry <POINT [°]> POINT (-98 26), POINT (-98 26), POINT (-98 26)...

viz_schools_grid <-
  schools_grid_sf %>%
  ggplot() +
  geom_polygon(
    data = tx_border,
    aes(x = long, y = lat, group = group),
    size = 1.5,
    color = "black",
    fill = NA
  ) +
  geom_sf(
    data = hwys_sf,
    linetype = "solid",
    size = 0.1
  ) +
  geom_point(
    aes(x = X, y = Y, size = value, color = value),
    # show.legend = FALSE,
    shape = 16
  ) +
  geom_sf(
    data = cities_sf,
    shape = 16,
    size = 2,
    fill = "black"
  ) +
  ggrepel::geom_label_repel(
    data = cities_sf,
    aes(x = x, y = y, label = city_nm)
  ) +
  coord_sf(datum = NA) +
  scale_color_viridis_c(option = "B", na.value = "#FFFFFF") +
  teplot::theme_map(legend.position = "bottom") +
  labs(
    title = str_wrap("Texas High School Math SAT Scores, 2015", 80),
    caption = "By Tony ElHabr."
  )
viz_schools_grid

Cool! I like this visualization because it seems to offer a finer amount
of detail compared to a choropleth. (In other words, it seems to
emphasize specific areas in counties and not the entire county itself.)
Nonetheless, there are some disadvantages of this technique.

  • There is subjectivity involved in the choice of precision for
    interpolation. The grid in my example seems a bit “too” granular
    around the San Antonio and Austin area, where it seems like there
    are no values at all! (Perhaps this is just an “operator error” on
    my behalf.)

  • sf::st_interpolate_aw() seems to only be capable of aggregating by
    sum (with extensive = TRUE) or mean (with extensive = FALSE).
    There are certainly some cases where other aggregation functions
    would be desirable. For my example, I actually would have preferred
    a maximum. An average is sensitive to area with a relatively small
    number of schools (that, consequently, may be “over”-emphasized by
    the value encoding); and a sum may too strongly emphasize areas with
    a large number of schools without providing any insight into their
    scores.

For comparison’s purposes, let’s look at what a choropleth map would
look like. I’ll need an additional data.frame for this
exercise—schools_tea_filt_join—which is just the schools_tea_filt
data joined with counties data that can be retrieved from a call to
ggplot2::map_data().

viz_schools_chlr <-
  ggplot() +
  geom_polygon(
    data = tx_border,
    aes(x = long, y = lat, group = group),
    size = 1.5,
    color = "black",
    fill = NA
  ) +
  geom_polygon(
    data = schools_tea_filt_join,
    aes(x = long, y = lat, group = group, fill = value),
  ) +
  geom_sf(
    data = hwys_sf,
    linetype = "solid",
    size = 0.1
  ) +
  geom_sf(
    data = cities_sf,
    shape = 16,
    size = 2,
    fill = "black"
  ) +
  ggrepel::geom_label_repel(
    data = cities_sf,
    aes(x = x, y = y, label = city_nm)
  ) +
  coord_sf(datum = NA) +
  scale_fill_viridis_c(option = "B", na.value = "#FFFFFF") +
  teplot::theme_map(legend.position = "bottom") +
  labs(
    title = str_wrap("Texas High School Math SAT Scores, 2015", 80),
    caption = "By Tony ElHabr."
  )
viz_schools_chlr

This choropleth actually isn’t so bad, but I think I still prefer the
bubble grid.

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)