An Analysis of Texas High School Academic Competition Results, Part 4 – Schools

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having investigated individuals elsewhere, let’s now take a look at the


Although I began the examinations of competitions and individuals by
looking at volume of participation (to provide context), I’ll skip an
analogous discussion here because the participation of schools is shown
indirectly through those analyses.)

School Scores

Let’s begin by looking at some of the same metrics shown for individual
students, but aggregated across all students for each school. In order
to give the reader some insight into school performance, I’ll rank and
show schools by a singular metric of performance. To be consistent, I’ll
use the same metric used for ranking the individuals–summed percentile
rank of scores (prnk_sum).

NOTE: For the same reason stated before for showing my own
scores among the individuals, I’ll include the numbers for my high
school (“CLEMENS”) in applicable contexts.

rnk school city n prnk_sum prnk_mean n_defeat_sum n_defeat_mean n_advanced_sum n_state_sum
1 ARGYLE ARGYLE 168 159.01 0.95 867 5.16 109 53
2 CLEMENTS SUGAR LAND 174 149.88 0.86 936 5.38 109 47
3 LINDSAY LINDSAY 154 134.39 0.87 791 5.14 93 40
4 KLEIN KLEIN 152 131.13 0.86 783 5.15 87 30
5 DULLES SUGAR LAND 155 129.02 0.83 825 5.32 90 37
6 WYLIE ABILENE 156 124.70 0.80 636 4.08 91 31
7 GARDEN CITY GARDEN CITY 144 122.77 0.85 823 5.72 85 33
8 HIGHLAND PARK DALLAS 149 121.71 0.82 655 4.40 85 25
9 SALADO SALADO 127 103.31 0.81 605 4.76 73 30
10 WESTWOOD AUSTIN 130 102.67 0.79 546 4.20 67 9
231 CLEMENS SCHERTZ 77 43.35 0.56 233 3.03 17 0

Note: 1 # of total rows: 1,436

Admittedly, there’s not a lot of insight to extract from this summary
regarding individual schools. Nonetheless, it provides some useful
context regarding the magnitude of performance metric values aggregated
at the school level.

To begin gaining some better understanding this list of top-performing
schools, let’s break down school performance by year.

Also, let’s combine the performance metric values with coordinate data
to visualize where the best schools are located.

Now, let’s visualize school dominance across years.

We saw elsewhere that there is no significant temporal trend for
competition types or competition level, but is there some kind of
temporal trend for schools? My intuition says that there should not
be any kind of significant relationship between year and performance.
Rather, I would guess that–going with the theory that certain schools
tend to do well all of the time–the school itself should have some
non-trivial relationship with performance. (If this is true, this would
imply that the top-performing schools have students that are better
suited for these academic competitions, perhaps due to a strong support
group of teachers, demographics, house income, or some other factor not
quantified directly here.) Also, I hypothesize that recent performance
is probably the strongest indicator of current performance, as it is in
many different contexts. I should note that I think these things may
only be shown to be true when also factoring in competition type–it
seems more likely that schools are “elite” for certain competition
types, as opposed to all competitions in aggregate.

To put these ideas together more plainly, I am curious to know if the
success of a school in any given year can be predicted as a function of
the school itself, the year, and the school’s performance in the
previous year. 1 As before, my preference for quantifying performance
is percent rank sum (prnk_sum) of team score (relative to other
schools at a given competition level). Also, I think it’s a good idea to
“re-scale” the year value to have a first value of 1 (corresponding to
the first year in the scraped data–2004), with subsequent years taking
on subsequent integer values. (This variable is named year_idx).

So, to be explicit, a linear regression
of the following
form is calculated for each unique school and competition type.
(Accounting for competition type allows us to properly model the reality
that a given school may excel in some competition types but not others.)

prnk_sum = intercept + prnk_sum{year-1} * \beta{1} + year_idx * \beta_{2}

*prnk*_*sum* = *intercept* + *prnk*_*sum**year* − 1 * β1 + *year*_*idx* * β2

Note that, because this formula is applied to each school-competition
type pair, the intercept term corresponds to the school entity itself.

The distribution of p-values
for each term in the model provide some insight regarding the predictive
power of the variables. Visually, it does seem like two of my hypotheses
are valid:

  1. Recent performance does seem to be predictive of school performance
    in a given competition type in any given year.

  2. Year itself is not predictive (meaning that there is no temporal
    trend indicating that performance improves or worsens over time).

However, my other thought that school itself has some kind of predictive
value does not appear to be true. 2

Perhaps the deduction that, in general, individual schools do not
tend to dominate the rest of the competition can be comprehended in
another way. The distribution of the percentage of possible opponent
schools defeated at each competition level for each school should
re-enforce this inference.

Indeed, observing that the histograms do not show any noticeable
skew to the right supports the notion that, in general, individual
schools are not dominating specific competition types. If this theory
were true, we would see some non-trivial right-hand skew. This
possibility is closest to being true (albeit not that close) with the
District level of competition (i.e. the lowest level of competition).
This observation is not all so surprising given that if it were true
that schools do dominate at some level of competition, it is most likely
to be true at the lowest level of competition.


Certainly analysis of schools in these academic UIL competitions
deserves some more attention than that given here, but I think some of
the biggest questions about school performance have been answered.

  1. Actually, I don’t specifically enforce the criteria that theprevious year is used. Rather, I use the most recent year’s value, which may or may not be the previous year if the school did not compete in the previous year.
  2. For more information regarding interpretation of p-value distributions, I recommend reading David Robinson’s very helpful blog post on the topic.

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)