An Analysis of Texas High School Academic Competition Results, Part 4 – Schools

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having investigated individuals elsewhere, let’s now take a look at the schools.


Although I began the examinations of competitions and individuals by looking at volume of participation (to provide context), I’ll skip an analogous discussion here because the participation of schools is shown indirectly through those analyses.)

School Scores

Let’s begin by looking at some of the same metrics shown for individual students, but aggregated across all students for each school. In order to give the reader some insight into school performance, I’ll rank and show schools by a singular metric of performance. To be consistent, I’ll use the same metric used for ranking the individuals–summed percentile rank of scores (prnk_sum).

NOTE: For the same reason stated before for showing my own scores among the individuals, I’ll include the numbers for my high school (“CLEMENS”) in applicable contexts.

rnk school city n prnk_sum prnk_mean n_defeat_sum n_defeat_mean n_advanced_sum n_state_sum
1 ARGYLE ARGYLE 168 159.01 0.95 867 5.16 109 53
2 CLEMENTS SUGAR LAND 174 149.88 0.86 936 5.38 109 47
3 LINDSAY LINDSAY 154 134.39 0.87 791 5.14 93 40
4 KLEIN KLEIN 152 131.13 0.86 783 5.15 87 30
5 DULLES SUGAR LAND 155 129.02 0.83 825 5.32 90 37
6 WYLIE ABILENE 156 124.70 0.80 636 4.08 91 31
7 GARDEN CITY GARDEN CITY 144 122.77 0.85 823 5.72 85 33
8 HIGHLAND PARK DALLAS 149 121.71 0.82 655 4.40 85 25
9 SALADO SALADO 127 103.31 0.81 605 4.76 73 30
10 WESTWOOD AUSTIN 130 102.67 0.79 546 4.20 67 9
231 CLEMENS SCHERTZ 77 43.35 0.56 233 3.03 17 0

Note: 1 # of total rows: 1,436

Admittedly, there’s not a lot of insight to extract from this summary regarding individual schools. Nonetheless, it provides some useful context regarding the magnitude of performance metric values aggregated at the school level.

To begin gaining some better understanding this list of top-performing schools, let’s break down school performance by year.

Also, let’s combine the performance metric values with coordinate data to visualize where the best schools are located.

Now, let’s visualize school dominance across years.

We saw elsewhere that there is no significant temporal trend for competition types or competition level, but is there some kind of temporal trend for schools? My intuition says that there should not be any kind of significant relationship between year and performance. Rather, I would guess that–going with the theory that certain schools tend to do well all of the time–the school itself should have some non-trivial relationship with performance. (If this is true, this would imply that the top-performing schools have students that are better suited for these academic competitions, perhaps due to a strong support group of teachers, demographics, house income, or some other factor not quantified directly here.) Also, I hypothesize that recent performance is probably the strongest indicator of current performance, as it is in many different contexts. I should note that I think these things may only be shown to be true when also factoring in competition type–it seems more likely that schools are “elite” for certain competition types, as opposed to all competitions in aggregate.

To put these ideas together more plainly, I am curious to know if the success of a school in any given year can be predicted as a function of the school itself, the year, and the school’s performance in the previous year. 1 As before, my preference for quantifying performance is percent rank sum (prnk_sum) of team score (relative to other schools at a given competition level). Also, I think it’s a good idea to “re-scale” the year value to have a first value of 1 (corresponding to the first year in the scraped data–2004), with subsequent years taking on subsequent integer values. (This variable is named year_idx).

So, to be explicit, a linear regression model of the following form is calculated for each unique school and competition type. (Accounting for competition type allows us to properly model the reality that a given school may excel in some competition types but not others.)

$$ prnk_sum = intercept + prnk_sum{year-1} * \beta{1} + year_idx * \beta_{2} $$

*prnk*_*sum* = *intercept* + *prnk*_*sum**year* − 1 * β1 + *year*_*idx* * β2

Note that, because this formula is applied to each school-competition type pair, the intercept term corresponds to the school entity itself.

The distribution of p-values for each term in the model provide some insight regarding the predictive power of the variables. Visually, it does seem like two of my hypotheses are valid:

  1. Recent performance does seem to be predictive of school performance in a given competition type in any given year.

  2. Year itself is not predictive (meaning that there is no temporal trend indicating that performance improves or worsens over time).

However, my other thought that school itself has some kind of predictive value does not appear to be true. 2

Perhaps the deduction that, in general, individual schools do not tend to dominate the rest of the competition can be comprehended in another way. The distribution of the percentage of possible opponent schools defeated at each competition level for each school should re-enforce this inference.

Indeed, observing that the histograms do not show any noticeable skew to the right supports the notion that, in general, individual schools are not dominating specific competition types. If this theory were true, we would see some non-trivial right-hand skew. This possibility is closest to being true (albeit not that close) with the District level of competition (i.e. the lowest level of competition). This observation is not all so surprising given that if it were true that schools do dominate at some level of competition, it is most likely to be true at the lowest level of competition.


Certainly analysis of schools in these academic UIL competitions deserves some more attention than that given here, but I think some of the biggest questions about school performance have been answered.

  1. Actually, I don’t specifically enforce the criteria that theprevious year is used. Rather, I use the most recent year’s value, which may or may not be the previous year if the school did not compete in the previous year. ^
  2. For more information regarding interpretation of p-value distributions, I recommend reading David Robinson’s very helpful blog post on the topic. ^

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)