Correlations Between Texas High School Academic Competition Results and SAT/ACT Scores

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I wanted to do a follow-up on my series of
posts
about Texas
high school University Interscholastic
League
(UIL) academic competitions to more
closely evaluate the relationship between the school performance in
those competitions with school-wide
SAT) and
ACT scores. For those who may not be familiar
with these tests, these are the two most popular standardized tests used
for college admission in the United States.

In my introduction to that series, I stated the following: School-wide
… scores on state- and national-standardized tests (e.g. the SAT)
certainly are the most common measure of academic strength, but I think
rankings by academic competitions may be more indicative.

Essentially, I was implying that the academic UIL scores may not
correspond well, or at all, with standardized test scores. However, I
did not attempt to prove this hypothesis, which is what I set out to do
here. While I’m at it, I’ll show the code and provide some commentary to
explain my process.

Data Collection

While I already have collected and cleaned the UIL data that I’ll need
by virtue of my work for my series of posts analyzing the UIL
competitions
, I did
not retrieve data for standardized test scores. To my delight, the
Texas Education Agency’s website publishes
Texas high school SAT and ACT scores for the years 2011 through 2015.
The task of scraping from this source is a perfect use-case for the
super-handy {xml2} and {rvest} packages, as well the well-known
awesome {stringr} and {purrr} packages in the {tidyverse}.

library("tidyverse")
library("rlang")
library("teplot") # Personal package.
urls_tea <-
  "https://tea.texas.gov/acctres/sat_act_index.html" %>%
  xml2::read_html() %>%
  rvest::html_nodes(xpath = "//tr //td //a") %>%
  rvest::html_attr("href") %>% 
  str_subset("^\\/acctres\\/[:alpha:]{3}_[Cc]ampus_[Dd]ata")
urls_tea
create_path_tea <-
  function(url_suffix, dir = "data-raw", ext = "csv") {
    if(!dir.exists(dir)) {
       dir.create(dir)
    }
    url_suffix %>%
      str_remove_all("acctres|\\/") %>% 
      paste0(".", ext) %>% 
      file.path(dir, .)
  }
# NOTE(s):
# + `urls_tea_dl` is actually the same as `url_tea` because `purrr::walk()` returns its first argument.
# + `mode = "wb"` is important! Otherwise, the downloaded files have empty lines every other line
# (due to the way that CR and LFs are handled.
urls_tea_dl <-
  urls_tea %>%
  walk(
    ~download.file(
      url = paste0("https://tea.texas.gov/", .x),
      destfile = create_path_tea(url_suffix = .x),
      mode = "wb"
    )
  )

Data Cleaning

Next, I bind the data from all of the downloaded files together and do
some cleaning. I put these actions in function(s) because I plan on
re-using them in future posts where I explore this data set in other
ways.

One relatively significant choice that I make here is to only include
the data for the school-wide level (via the "All Students" filtering
criteria), although data for different demographics within each school
is provided. The other data set that I am evaluating—the academic UIL
data— does not have demographci-specific information, so I want to treat
the two set as “equally” as possible.

Additionally, in order to better understand the resulting data set, the
reader should be made aware of some of the details of the tests. The SAT
has math, reading, and writing sections, each having minimum and
maximum scores of 200 and 800, meaning that the total can range from
600 to 2400. The ACT has math, reading, english, and science
sections, each having a minimum and maximum score of 1 and 36, combined
for a single compos score also ranging from 1 to 36. To eliminate
duplicate columns representing the same underlying “thing”. I don’t
distinguish the math and reading section scores for each test in
separate columns, I rename the ACT’s compos score to total,
following the convention used for the SAT’s cumulative score. The other
sections—writing for the SAT and english and science for the ACT—
are not really analogous to sections in the other test, so they are
filled with NAs appropriately.

Finally, for the interEsted reader, there are some details regarding the
code implementation that I document in comments (both for explaining
actions for myself and for the reader).

import_tea_data <-
  function(path, rgx_grp) {
    res <-
      path %>%
      readr::read_csv() %>%
      rename_all(funs(tolower))
    
    if(!is.null(rgx_grp)) {
      res <-
        res %>%
        filter(group %>% str_detect(rgx_grp))
    }
    res <-
      res %>%
      select(
        matches(
          "^group$|name$|math|reading|writing|total|english|science|compos"
        )
      )
    res
  }

import_tea_data_cleanly <-
  function(urls, rgx_grp, ...) {

    res <-
      urls %>%
      create_path_tea(...) %>%
      tibble(path = .) %>%
      mutate(
        test = stringr::str_extract(path, "([Ss][Aa][Tt])|([Aa][Cc][Tt])") %>% toupper(),
        year = stringr::str_extract(path, "[:digit:]+") %>% as.integer()
      ) %>%
      mutate(contents = purrr::map(path, ~import_tea_data(.x, rgx_grp = rgx_grp))) %>%
      unnest() %>%
      # NOTE: No longer need this columns(s) any more.
      select(-path) %>%
      mutate_at(vars(total), funs(ifelse(test == "ACT", compos, .))) %>% 
      # NOTE: No longer need this column(s) any more.
      select(-compos) %>% 
      # NOTE: Rearranging score columns in a more logical fashion.
      select(-total, everything(), total) %>% 
      # NOTE: Renaming "important" columns.
      rename(school = campname,
             district = distname,
             county = cntyname,
             city = regnname) %>%
      mutate_if(is.character, funs(str_replace_all(., "=|\"", ""))) %>%
      mutate_at(vars(school, district, county, city), funs(toupper)) %>% 
      # NOTE: Some county names are truncated and end with COUN or COUNT.
      # (The max seems to be 18 characters).
      # Fortunately, ther are no county names with COUN in their names, so the following
      # regular expression is sufficient.
      mutate_at(vars(county), funs(str_remove_all(., "\\s+COUN.*$"))) %>%
      # NOTE: Remove all HS/H S at the end of school names, as well as ampersands.
      # This seems to improve join percentages with other data sets.
      mutate_at(vars(school), funs(str_remove_all(., "([H]\\s*[S]$)|(\\s+\\&)") %>% str_trim())) %>%
      # NOTE: This is (try to) to resolve duplicates in raw data.
      # group_by_at(vars(matches("test|year|school|district|county|city"))) %>% 
      # summarise_all(funs(max(., na.rm = TRUE))) %>% 
      # ungroup() %>% 
      arrange(test, year, school)
    res
  }
schools_tea <-
  urls_tea %>%
  import_tea_data_cleanly(rgx_grp = "All Students")
schools_tea
testyearschooldistrictcountycitymathreadingwritingenglishsciencetotal
ACT2011A C JONESBEEVILLE ISDBEECORPUS CHRISTI1918NA171918
ACT2011A J MOORE ACADWACO ISDMCLENNANWACO1918NA161818
ACT2011A M CONSCOLLEGE STATION ISDBRAZOSHUNTSVILLE2624NA232424
ACT2011A MACEO SMITH HIGH SCHOOLDALLAS ISDDALLASRICHARDSON1614NA131514
ACT2011ABBOTT SCHOOLABBOTT ISDHILLWACO2020NA192120
ACT2011ABERNATHYABERNATHY ISDHALELUBBOCK2220NA192121
ACT2011ABILENEABILENE ISDTAYLORABILENE2121NA202121
ACT2011ACADEMYACADEMY ISDBELLWACO2423NA212423
ACT2011ACADEMY HIGH SCHOOLHAYS CISDHAYSAUSTINNANANANANANA
ACT2011ACADEMY OF CAREERS AND TECHNOLOGIEACADEMY OF CAREERS AND TECHNOLOGIEBEXARSAN ANTONIO1514NA121414
ACT2011ACADEMY OF CREATIVE EDNORTH EAST ISDBEXARSAN ANTONIONANANANANANA
ACT2011ADRIAN SCHOOLADRIAN ISDOLDHAMAMARILLO1918NA201919
ACT2011ADVANTAGE ACADEMYADVANTAGE ACADEMYDALLASRICHARDSON1820NA191618
ACT2011AGUA DULCEAGUA DULCE ISDNUECESCORPUS CHRISTI2119NA182019
ACT2011AIM CENTERVIDOR ISDORANGEBEAUMONTNANANANANANA
ACT2011AKINSAUSTIN ISDTRAVISAUSTIN1917NA161717
ACT2011ALAMO HEIGHTSALAMO HEIGHTS ISDBEXARSAN ANTONIO2524NA242424
ACT2011ALBA-GOLDENALBA-GOLDEN ISDWOODKILGORE2019NA182019
ACT2011ALBANY JR-SRALBANY ISDSHACKELFORDABILENE2422NA212222
ACT2011ALDINEALDINE ISDHARRISHOUSTON1917NA161818
1 # of total rows: 15,073

EDA: Year-to-Year Correlations

First, before evaluating the primary concern at hand—the relationship
between the academic UIL scores and the SAT/ACT scores (available in the
schools_tea data created above)—I want to verify that there is some
non-trivial relationship among the scores for a given school on a given
test across years. (I would be surprised if this were not shown to be
true.)

schools_tea_cors_byyear <-
  schools_tea %>%
  distinct(test, year, school, .keep_all = TRUE) %>%
  filter(!is.na(total)) %>%
  unite(test_school, test, school) %>%
  widyr::pairwise_cor(
    feature = test_school,
    item = year,
    value = total
  ) %>% 
  rename(year1 = item1, year2 = item2, cor = correlation)
schools_tea_cors_byyear %>% 
  filter(year1 <= year2)
year1year2cor
201120120.80
201120130.76
201220130.86
201120140.69
201220140.78
201320140.83
201120150.64
201220150.74
201320150.78
201420150.86

![](viz_schools_tea_cors_byyear_show-1.png)

As expected, there are some strong correlations among the years for
school-wide scores on these tests.

Ok, now let’s bring in the “cleaned” school data (`schools_uil`) that I
collected and cleaned in my UIL analysis. I’ll subset the data to
include only the same years found in `schools_tea`—2011 through 2015.

schoolcitycomplvl_numscoreyearconfcomplvlcompadvancedn_staten_bycompprnkn_defeatw
HASKELLHASKELL1361620111DistrictCalculator Applications1081.007TRUE
POOLVILLEPOOLVILLE1360920111DistrictCalculator Applications1080.866FALSE
LINDSAYLINDSAY1755320111DistrictCalculator Applications1071.006TRUE
PLAINSPLAINS353720111DistrictCalculator Applications10101.009TRUE
SAN ISIDROSAN ISIDRO3253420111DistrictCalculator Applications1041.003TRUE
CANADIANCANADIAN752720111DistrictCalculator Applications1071.006TRUE
GARDEN CITYGARDEN CITY1051820111DistrictCalculator Applications1081.007TRUE
WATER VALLEYWATER VALLEY1047820111DistrictCalculator Applications0080.866FALSE
GRUVERGRUVER746420111DistrictCalculator Applications0070.835FALSE
YANTISYANTIS1945120111DistrictCalculator Applications10101.009TRUE
SHINERSHINER2745020111DistrictCalculator Applications1091.008TRUE
WEST TEXASSTINNETT744320111DistrictCalculator Applications0070.674FALSE
HONEY GROVEHONEY GROVE1744020111DistrictCalculator Applications1070.835FALSE
LATEXOLATEXO2343920111DistrictCalculator Applications10101.009TRUE
MUENSTERMUENSTER1743620111DistrictCalculator Applications0070.674FALSE
VAN HORNVAN HORN143620111DistrictCalculator Applications1071.006TRUE
SLOCUMELKHART2341520111DistrictCalculator Applications00100.898FALSE
ERAERA1741520111DistrictCalculator Applications0070.503FALSE
GOLDTHWAITEGOLDTHWAITE1541320111DistrictCalculator Applications1071.006TRUE
NEWCASTLENEWCASTLE1240820111DistrictCalculator Applications10101.009TRUE
1 # of total rows: 27,359

Now let’s try to evaluate whether or not year-to-year correlations also
exist with this data set.

Importantly, some choice about how to quantify performance needs to be
made. As I discussed in my long-form series of posts exploring the UIL
academic data
, the
evaluation of performance is somewhat subjective. Should we use number
of times a school advanced to the next level of competition in a given
year? (Note that there are three competition levels—District,
Region, and State.) What about the number the number of other
schools it “defeated” in head-to-head competitions? In that separate
analysis, I made the choice to use the percentile rank (prnk) of the
school’s placings across all competition levels for a given competition
type (comp). I believe this measure bests represent a school’s quality
of performance (where a higher value indicates better performance). As I
stated there when explaining my choice to use percent rank for
identifying “dominant” individual“, ”I choose to use percent rank—which
is a always a value between 0 and 1—because it inherently accounts for
the wide range of number of competitors across all competitions. (For
this context, a percent rank of 1 corresponds to the highest score in a
given competition, and, conversely, a value of 0 corresponds to the
lowest score.)”

Aside from this decision regarding performance evaluation in academic
UIL competitions, note that I treat the competition type (comp) in
schools_uil as analogous to the test variable indicating SAT or ACT
score in the schools_tea data set. For those who have not read through
my UIL analysis, note that scores for five different competition types
was collected—Calculator Applications, Computer Science,
Mathematics, Number Sense, and Science.

schools_uil_cors_byyear <-
  schools_uil %>% 
  select(year, school, city, comp, prnk) %>% 
  group_by(year, school, city, comp) %>% 
  summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>%
  ungroup() %>% 
  unite(comp_school, comp, school) %>%
  widyr::pairwise_cor(
    feature = comp_school,
    item = year,
    value = prnk_sum
  ) %>% 
  rename(year1 = item1, year2 = item2, cor = correlation)

schools_uil_cors_byyear %>% 
  filter(year1 <= year2) 

table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;“>

year1year2cor201120120.74201120130.63201220130.72201120140.53201220140.60201320140.75201120150.48201220150.52201320150.61201420150.70

We can see that correlations among years do exist, as we would expect.
The strength of the correlations decrease for years that are farther
apart, which is also what we might expect.

“Final” Correlation Analysis

So, at this point, I have set myself up to do that which I set out to
do—evaluate the relationship between the academic UIL competition scores
and the national SAT/ACT scores.

In order to put the two sets of data on “equal grounds”, I only evaluate
math scores. In particular, I filter comp in the UIL data to just the
mathematically-based competitions—Calculator Applications,
Mathematics, and Number Sense—excluding Science and
Computer Science. And, for the SAT/ACT data, I select only the math
score, which is available fore both tests, excluding the total and
reading scores also available for each and the writing, english,
and science scores available for one or the other. (Perhaps the ACT’s
science score could be compared to the Science UIL scores, but I
choose not to do so here.)

schools_uil_math <-
  schools_uil %>%
  filter(str_detect(comp, "Calculator|Math|Number")) %>%
  group_by(year, school, city) %>% 
  summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>%
  ungroup() %>% 
  # NOTE: "Renormalize" `prnk_sum`.
  mutate(math_prnk = percent_rank(prnk_sum)) %>% 
  select(-prnk_sum)
schools_uil_math
yearschoolcitymath_prnk
2011ABBOTTABBOTT0.82
2011ABERNATHYABERNATHY0.59
2011ABILENEABILENE0.00
2011ACADEMYKINGSVILLE0.70
2011ACADEMYLITTLE RIVER0.81
2011ACADEMY OF FINE ARTSFORT WORTH0.55
2011ADAMSDALLAS0.47
2011ADAMSONDALLAS0.42
2011ADRIANADRIAN0.36
2011ADVANTAGE ACADEMYWAXAHACHE0.22
2011AGUA DULCEAGUA DULCE0.57
2011ALAMO HEIGHTSSAN ANTONIO0.70
2011ALBA-GOLDENALBA0.72
2011ALBANYALBANY0.95
2011ALEDOALEDO0.89
2011ALEXANDERLAREDO0.56
2011ALICEALICE0.10
2011ALLENALLEN0.85
2011ALPINEALPINE0.57
2011ALTOALTO0.19
1 # of total rows: 5,596
schools_tea_math <-
  schools_tea %>%
  select(test, year, school, city, math) %>%
  filter(!is.na(math)) %>% 
  group_by(test) %>% 
  mutate(math_prnk = percent_rank(math)) %>%
  ungroup() %>% 
  group_by(year, school, city) %>% 
  summarise_at(vars(math_prnk), funs(mean(., na.rm = TRUE))) %>% 
  ungroup()
schools_tea_math
yearschoolcitymath_prnk
2011A C JONESCORPUS CHRISTI0.51
2011A J MOORE ACADWACO0.24
2011A M CONSHUNTSVILLE0.97
2011A MACEO SMITH HIGH SCHOOLRICHARDSON0.03
2011A+ ACADEMYRICHARDSON0.14
2011ABBOTT SCHOOLWACO0.72
2011ABERNATHYLUBBOCK0.63
2011ABILENEABILENE0.60
2011ACADEMYWACO0.91
2011ACADEMY HIGH SCHOOLAUSTIN0.32
2011ACADEMY OF CAREERS AND TECHNOLOGIESAN ANTONIO0.03
2011ACADEMY OF CREATIVE EDSAN ANTONIO0.48
2011ADRIAN SCHOOLAMARILLO0.27
2011ADVANTAGE ACADEMYRICHARDSON0.25
2011AGUA DULCECORPUS CHRISTI0.66
2011AKINSAUSTIN0.25
2011ALAMO HEIGHTSSAN ANTONIO0.95
2011ALBA-GOLDENKILGORE0.52
2011ALBANY JR-SRABILENE0.83
2011ALDINEHOUSTON0.23
1 # of total rows: 7,730
schools_join_math <-
  schools_tea_math %>%
  rename_at(vars(matches("^math")), funs(paste0("tea_", .))) %>% 
  inner_join(schools_uil_math %>% 
             rename_at(vars(matches("^math")), funs(paste0("uil_", .))),
           by = c("year", "school", "city")) %>%
  select(year, school, city, matches("math"))
schools_join_math
yearschoolcitytea_math_prnkuil_math_prnk
2011ABILENEABILENE0.600.00
2011ALAMO HEIGHTSSAN ANTONIO0.950.70
2011AMERICASEL PASO0.310.69
2011ANDERSONAUSTIN0.980.64
2011ANDRESSEL PASO0.150.56
2011ARLINGTON HEIGHTSFORT WORTH0.630.49
2011AUSTINAUSTIN0.890.22
2011AUSTINEL PASO0.220.68
2011AUSTINHOUSTON0.090.85
2011BEL AIREL PASO0.170.49
2011BERKNERRICHARDSON0.800.50
2011BOSWELLFORT WORTH0.830.22
2011BOWIEAUSTIN0.970.70
2011BOWIEEL PASO0.150.15
2011BRANDEISSAN ANTONIO0.790.39
2011BREWERFORT WORTH0.480.19
2011BURBANKSAN ANTONIO0.220.76
2011BURGESEL PASO0.570.70
2011CALALLENCORPUS CHRISTI0.470.93
2011CANUTILLOEL PASO0.180.81
1 # of total rows: 699
schools_join_math_cors <-
  schools_join_math %>%
  select(-year) %>% 
  select_if(is.numeric) %>%
  corrr::correlate()
schools_join_math_cors
rownametea_math_prnkuil_math_prnk
tea_math_prnkNA0.36
uil_math_prnk0.36NA

So, this correlation value—0.36—seems fairly low. At face value, it
certainly does not provide any basis to claim that schools that do well
in the academic UIL competitions also do well with SAT/ACT tests.
However, perhaps if I used a different methodology, the result would be
different. Other metrics used to quantify academic UIL performance could
be tested in some kind of sensitivity analysis.

EDA: Year-to-Year Correlations, Cont.

While I won’t do any kind of rigorous second evaluation here, I do want
to try to quantify the impact of the “missing” data dropped due to
mismatched school names. If all possible data had been used, would the
final correlation value have increased (or decreased) with more (or
less) data? Although finding direct answer to this question is
impossible, we can evaluate the difference in the year-to-year
correlations of scores from the schools that are joined with the
correlations calculated for all in “unjoined” schools_tea and
schools_uil data sets. If we find that there are large discrepancies
(one way or the other), then we may have some reason to believe that the
0.36 number found above is misleading.

To perform this task, I create a couple of intermediary data sets, as
well as some functions.

schools_postjoin_math_tidy <-
  schools_join_math %>%
  unite(school_city, school, city) %>% 
  gather(metric, value, matches("prnk"))
pairwise_cor_f1 <-
  function(data, which = c("tea", "uil")) {
    which <- match.arg(which)
    data %>%
      filter(metric %>% str_detect(which)) %>% 
      # filter_at(vars(value), all_vars(!is.nan(.))) %>% 
      widyr::pairwise_cor(
        feature = school_city,
        item = year,
        value = value
      ) %>% 
      rename(year1 = item1, year2 = item2, cor = correlation) %>% 
      mutate(source = which %>% toupper())
  }

pairwise_cor_f2 <-
  function(data, which = c("tea", "uil")) {
    which <- match.arg(which)
    col <-
      data %>%
      names() %>% 
      str_subset("math")
    data %>%
      unite(school_city, school, city) %>% 
      rename(value = !!rlang::sym(col)) %>%
      mutate(source = which %>% toupper()) %>% 
      widyr::pairwise_cor(
        feature = school_city,
        item = year,
        value = value
      ) %>% 
      rename(year1 = item1, year2 = item2, cor = correlation) %>% 
      mutate(source = which %>% toupper())
  }
schools_postjoin_math_cors_byyear <-
  bind_rows(
    schools_postjoin_math_tidy %>% 
      pairwise_cor_f1("tea"),
    schools_postjoin_math_tidy %>% 
      pairwise_cor_f1("uil")
  )

schools_prejoin_math_cors_byyear <-
  bind_rows(
    schools_tea_math %>%
      pairwise_cor_f2("tea"),
    schools_uil_math %>%
      pairwise_cor_f2("uil")
  )

schools_math_cors_byyear_diffs <-
  schools_postjoin_math_cors_byyear %>% 
  inner_join(schools_prejoin_math_cors_byyear, 
             by = c("year1", "year2", "source"), 
             suffix = c("_join", "_unjoin")) %>% 
  mutate(cor_diff = cor_join - cor_unjoin)

Ok, enough of the data munging—let’s review the results!

schools_math_cors_byyear_diffs_wide <-
  schools_math_cors_byyear_diffs %>% 
  filter(year1 <= year2) %>% 
  select(-matches("join$")) %>% 
  unite(year_pair, year1, year2) %>% 
  spread(source, cor_diff)
schools_math_cors_byyear_diffs_wide
year_pairTEAUIL
2011_20120.09-0.02
2011_20130.05-0.04
2011_20140.11-0.06
2011_20150.24-0.03
2012_20130.010.00
2012_20140.060.04
2012_20150.150.00
2013_2014-0.01-0.03
2013_20150.080.00
2014_20150.05-0.01

Note that the correlations in the joined data are a bit “stronger”—in
the sense that they are more positive—among the TEA SAT/ACT data,
although not in any kind of magnificent way. Additionally, the
differences for the UIL data are trivial. Thus, we might say that the
additional data that could have possibly increased (or decreased) the
singular correlation value found—0.36—would not have changed much at
all.

Conclusion

So, my initial inclination in my analysis of academic UIL
competitions
) seems
correct—there is no significant relationship between Texas high school
academic competition scores and standardized test scores (for math,
between 2011 and 2015). And, with that question answered, I intend to
explore this rich data set in other ways in future blog posts.

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)