my proposed method for evaluating NFL kickers. To get all the results and
some extra info about the data, check it out.
When you have a hammer, everything looks like a nail, right? Well I’m a big
fan of multilevel models and especially the ability of MCMC estimation to
fit these models with many, often sparse, groups. Great software implementations
like Stan and my favorite R interface to it,
brms make doing applied
work pretty straightforward and even fun.
As I spent time lamenting the disappointing season my Chicago Bears have been
having, I tried to think about how seriously I should take the shakiness of
their new kicker, Eddy Pineiro. Just watching casually, it can be hard to really
know whether a kicker has had a very difficult set of kicks or not and what
an acceptable FG% would be. This got me thinking about how I could use what
I know to satisfy my curiosity.
Of course, I’m not exactly the first person to want to do something to account
for those differences. Chase Stuart over
at Football Perspectives
used field goal distance to adjust FG% for difficulty (as well as comparing
kickers across eras by factoring in generational differences in kicking success).
does something similar — adjusting for distance — when grading kickers.
Generally speaking, the evidence suggests that just dealing with kick distance
gets you very far along the path to identifying the best kickers.
Chris Clement at the
Passes and Patterns blog
provides a nice review of the statistical and theoretical approaches to the
issue. There are several things that are clear from previous efforts, besides
the centrality of kick distance. Statistically, methods based
on the logistic regression model are appropriate and the most popular —
logistic regression is a statistical method designed to predict binary events
(e.g., making/missing a field goal) using multiple sources of information. And
besides kick distance, there are certainly elements of the environment that
matter — wind, temperature, elevation among them — although just how much
and how easily they can be measured is a matter of debate.
There has also been a lot of interest in game situations, especially clutch
kicking. Do kickers perform worse under pressure, like when their kicks will
tie the game or give their team the lead late in the game? Does “icing” the
kicker, by calling a timeout before the kick, make the kicker less likely to
be successful? Do they perform worse in playoff games?
On icing, Moskowitz and Wertheim (2011),
Clark, Johnson, and Stimpson (2013),
and LeDoux (2016)
do not find compelling evidence that icing the kicker is effective. On the other
Berry and Wood (2004)
Goldschmied, Nankin, and Cafri (2010),
and Carney (2016)
do find some evidence that icing hurts the kicker. All these have some limitations,
including which situations qualify as “icing” and whether we can identify
those situations in archival data. In general, to the extent there may be an
effect, it looks quite small.
Most important in this prior work is the establishment of a few approaches to
quantification. A useful way to think about comparing kickers is to know what
their expected FG% (eFG%) is. That is, given the difficulty of their kicks,
would some hypothetical average kicker have fared? Once we have an expected FG%,
we can more sensibly look at the kicker’s actual FG%. If we have two kickers
with an actual FG% of 80%, and kicker A had an eFG% of 75% while
kicker B had an eFG% of 80%, we can say kicker A is doing better because he
clearly had a more difficult set of kicks and still made them at the same rate.
Likewise, once we have eFG%, we can compute points above average (PAA).
This is fairly straightforward since we’re basically just going to take the
eFG% and FG% and weight them by the number of kicks. This allows us to
appreciate the kickers who accumulate the most impressive (or unimpressive)
kicks over the long haul. And since coaches generally won’t try kicks they
expect to be missed, it rewards kickers who win the trust of their coaches and
get more opportunities to kick.
Extensions of these include replacement FG% and points above
replacement, which use replacement level as a reference point rather than
average. This is useful because if you want to know whether a kicker is playing
badly enough to be fired, you need some idea of who the competition is.
PAA and eFG% are more useful when you’re talking about greatness and who
deserves a pay raise.
Pasteur and Cunningham-Rhoads — I’ll refer to them as PC-R for short —
gathered more data than most predecessors, particularly in terms of auxiliary
environmental info. They have wind, temperature, and presence/absence of
precipitation. They show fairly convincingly that while modeling kick distance
is the most important thing, these other factors are important as well. PC-R
also find the cardinal direction of every NFL stadium (i.e., does it run
north-south, east-west, etc.) and use this information along with wind direction
data to assess the presence of cross-winds, which are perhaps the trickiest for
kickers to deal with. They can’t know about headwinds/tailwinds because as far
as they (and I) can tell, nobody bothers to record which end zone teams defend
at the game’s coin toss, so we don’t know without looking at video which
direction the kick is going. They ultimately combine the total wind and the
cross wind, suggesting they have some meaningful measurement error that makes
them not accurately capture all the cross-winds. Using
their logistic regressions that factor for these several factors, they calculate
an eFG% and use it and its derivatives to rank the kickers.
include some predictors that, while empirically justifiable based on their
results, I don’t care to include. These are especially indicators of defense
quality, because I don’t think this should logically effect the success of a
kick and is probably related to the selection bias inherent to the coach’s
decision to try a kick or not. They also include a “kicker fatigue” variable
that appears to show that kickers attempting 5+ kicks in a game are less
successful than expected. I don’t think this makes sense and so I’m not
going to include it for my purposes.
They put some effort into defining a
“replacement-level” kicker which I think is sensible in spite of some limitations
they acknowledge. In my own efforts, I decided to do something fairly similar
by using circumstantial evidence to classify a given kicker in a given
situation as a replacement or not.
PC-R note that their model seems to overestimate the probability of very long
kicks, which is not surprising from a statistical standpoint given that there
are rather few such kicks, they are especially likely to only be taken by those
with an above-average likelihood of making them, and the statistical assumption
of linearity is most likely to break down on the fringes like this. They also
mention it would be nice to be able to account for kickers having different
leg strengths and not just differing in their accuracy.
Osborne and Levine (I’ll call them OL) take an important step in trying to
improve upon some of these limitations. Although they don’t use this phrasing,
they are basically proposing to use multilevel models, which treat each kicker
as his own group and thereby accounting for the possibility — I’d say it’s a
certainty — that kickers differ from one another in skill.
model has several positive attributes, especially that it not only adjusts for
the apparent differences in kickers but also that it looks skeptically upon
small sample sizes. A guy who makes a 55-yard kick in his first career attempt
won’t be dubbed the presumptive best kicker of all time because the model will
by design account for the fact that a single kick isn’t very informative. This
means we can simultaneously improve the prediction accuracy on kicks, but also
use the model’s estimates of kicker ability without over-interpreting
small sample sizes. They also attempt to use a quadratic term for kick
distance, which could better capture the extent to which the marginal
difference of a few extra yards of distance is a lot different when you’re at
30 vs. 40 vs. 50 yards. OL are unsure about whether the model justifies
including the quadratic term but I think on theoretical grounds it makes a lot
OL also discuss using a clog-log link rather than the logistic link, showing
that it has better predictive accuracy under some conditions. I am going to
ignore that advice for a few reasons, most importantly because the advantage is
small and also because the clog-log link is computationally intractable with
the software I’m using.
My tool is a multilevel logistic regression fit via
MCMC using the wonderful
brms R package.
I actually considered several models for model selection.
In all cases, I have random intercepts for kicker and stadium. I also use
random slopes for both kick distance and wind at the stadium level. Using
random wind slopes at the stadium level will hopefully capture the prevailing
winds at that stadium. If they tend to be helpful, it’ll have a larger absolute
slope. Some stadiums may have swirling winds and this helps capture that as
well. The random slope for distance hopefully captures some other things, like
elevation. I also include interaction terms for wind and kick distance as well
as temperature and kick distance, since the elements may only affect longer
There are indicators for whether the kick was “clutch” — game-tying or
go-ahead in the 4th quarter — whether the kicker was “iced,” and whether
the kick occurred in the playoffs. There is an interaction term between
clutch kicks and icing to capture the ideal icing situation as well.
I have a binary variable indicating whether the kicker was, at the time, a
replacement. In the main post, I describe the
decision rules involved in that. I have interaction terms for replacement
kickers and kick distance as well as replacement kickers and clutch kicks.
I have two random slopes at the kicker level:
- Kick distance (allowing some kickers to have stronger legs)
- Season (allowing kickers to have a career trajectory)
Season is modeled with a quadratic term so that kickers can decline over
time — it also helps with the over-time ebb and flow of NFL FG%. It would
probably be better to use a GAM for this to be more flexible, but they are a
All I’ve disclosed so far is enough to have one model. But I also explore
the form of kick distance using polynomials. OL used a quadratic term, but I’m
not sure even that is enough. I compare 2nd, 3rd, and 4th degree polynomials
for kick distance to try to improve the prediction of long kicks in particular.
Of course, going down the road of polynomials can put you on a glide path
towards the land of overfitting.
I fit several models, with combinations of the following:
- 2nd, 3rd, or 4th degree polynomial
brmsdefault improper priors on the fixed and random effects or
weakly informative normal priors on the fixed and random effects
- Interactions with kick distance with either all polynomial terms or just
the first and second degree terms
That last category is one that I suspected — and later confirmed — could
cause some weird results. Allowing all these things to interact with a 3rd
and 4th degree polynomial term made for some odd predictions on the fringes,
like replacement-level kickers having a predicted FG% of 70% at 70 yards out.
I looked at several criteria to compare models.
A major one was
approximate leave-one-out cross-validation.
I will show the LOOIC, which is interpreted like AIC/BIC/DIC/WAIC in terms of
lower numbers being better. This produces the same ordering as the ELPD, which
has the opposite interpretation in that higher numbers are better.
Another thing I looked at was generating prediction weights for the models via
Bayesian model stacking.
I also calculated Brier scores,
which are a standard tool for looking at prediction accuracy for binary
outcomes and are simply the mean squared prediction error. Last among the
quantitative measures is the
(area under the curve), which is another standard tool in the evaluation
of binary prediction models.
Beyond these, I also plotted predictions in areas of interest where I’d like
the model to perform well (like on long kicks) and checked certain cases
where external information not shown directly to the model gives me a relatively
strong prior. Chief among these was whether it separated the strong-legged
Below I’ve summarized the model comparison results. I shade the metrics
darker wherever the number is better — sometimes lower numbers are better,
sometimes higher numbers are. The bolded, red-colored row is the model I
|Polynomial degree||Interaction degree||Priors||LOOIC||Model weight||Brier score||AUC|
So why did I choose that model? The approximate LOO-CV procedure picks it as
the third model, although there’s enough uncertainty around those estimates
that it could easily be the best — or not as good as some ranked below it.
It’s not clear that the 4th degree polynomial does a lot of good in the
models that have it and it increases the risk of overfitting. It seems to
reduce the predicted probability of very long kicks, but as I’ve thought about
it more I’m not sure it’s a virtue.
Compared to the top two models, which distinguish themselves from the chosen
model by their use of proper priors, the model I chose does better on the
in-sample prediction accuracy metrics without looking much different on the
approximate out-of-sample ones. It doesn’t get much weight because it doesn’t
have much unique information compared to the slightly-preferred version with
proper priors. But as I looked at the models’ predictions, it appeared to me
that the regularization with the normal priors was a little too aggressive
and wasn’t picking up on the differences among kickers in leg strength.
That being said, the choices among these top few models are not very important
at all when it comes to the basics of who are the top and bottom ranked kickers.
Notes on predicted success
I initially resisted including things like replacement status and anything else
that is a fixed characteristic of a kicker (at least within a season) or
kicker-specific slopes in the
model because I planned to extract the random intercepts and use that as my
metric. Adding those things would make the random intercepts less interpretable;
if a kicker is bad and there’s no “replacement” variable, then the intercept
will be negative, but with the “replacement” variable the kicker may not have
a negative intercept after the adjustment for replacement status.
Instead, I decided to focus on model predictions. Generating the expected
FG% and replacement FG% was pretty straightforward. For eFG%, take all kicks
attempted and set
replacement = 0. For rFG%, take all kicks and set
replacement = 1.
To generate kicker-specific probabilities, though, I had to decide how to
incorporate this information. I’d clearly overrate new, replacement-level
kickers. My solution to this was to, before generating predictions on
hypothetical data, set each kicker’s
replacement variable to his career
season, on the other hand, I could eliminate the kicker-specific
aspect of this by intentionally zeroing these effects out in the
predictions. If I wanted to predict success in a specific season, of course,
I could include this.