While my time at the 2011 Joint Statistical Meetings was short–I unfortunately missed some presentations I would have like to have attended–it was a great experience. The collection of academics and professionals is very different from the other conferences that I have attended (like Sport Management and Tourism conferences) and the interest in the methods themselves at JSM really forced me to be on my toes.
While there, I got the chance to put some faces with the names I have seen around the blogosphere. It was a pleasure to meet both Phil Birnbuam–of Sabermetric Research Blog–and David Smith–VP of Revolution Analytics Marketing and author of the Revolutions Blog. David asked about sharing my poster (joint with fellow graduate student, Steve Salaga) investigating Hockey Hall of Fame Induction using the R package “randomForest”. While ‘machine learning’ can sound intimidating to some, Random Forests are actually quite a simple method for bootstrapping classification trees and allowing for random variable selection and a hold-out sample for each tree so that over-fitting is kept to a minimum. And what better way to implement it than with sports data!?!
As a side note, this is not the first time we have implemented randomForest for sports data. Steve and I have a forthcoming paper in the Journal of Quantitative Analysis in Sports identifying patters in BBWAA voting for the Baseball Hall of Fame. Our paper is similar to a recent work by Frieman (2011) in the same journal, but we add pitchers and a discussion of exclusions based on race. As a whole, it seems that there does not seem to be any negative effect of being a minority when it comes to BBWAA voting–at least according to the method we use.
So back to the Hockey Hall of Fame. For both this poster and the baseball paper, it is important to note that we are not attempting to gauge who should be in the Hall of Fame based on their performance as a player. Rather, we are attempting to gauge how well each player aligns with the views of the Hall of Fame Voting Committee and whether or not they were ‘snubbed’ based on how the committee would be predicted to vote. If the committee is terrible at gauging the best players, then our model will be as well. We are simply interested in the voting behavior and committee preferences, and not who the best players really are. This is an important distinction in attempting to find any exclusions based on qualitative variables like race or language, rather than attempting to rank the best players in the game.
We only include simple statistics–as we predict committee members to focus on these mostly–and goalies are not included in the analysis. Unfortunately, statistics for goalies are few and far between and the NHL has not kept Save Percentage for long enough to include in any worthwhile prediction model for goalies. Therefore, only skaters are included. We separate forwards and defensemen, but the only significant difference is the importance of Assists (they’re higher for defensemen).
For example, classifying baseball player inductions on WAR or Win Shares gives us who probably should be the guys in the Hall based on their on-field performance. However, BBWAA voters do not necessarily use this metric when voting. Therefore, we want to train our data to what BBWAA voters do pay attention to. The same goes for hockey. The most important statistics for classifying players are what you would expect, and they are also presented using the Random Forest’s “Variable Importance” metric.
This also allowed us to qualitatively evaluate the decision rule boundaries built by the forest and assess the possibility of certain players being discriminated against based on language. There is a line of (conflicting) economic literature–mostly in the 1980s and 1990s–that has made claims of language-based discrimination in the labor market for hockey, so we found the Hall of Fame voting to be another good test of this. Long story short, however, there does not seem to be anything systematic going on. But we leave that up to the reader, as we present each of the players near the boundaries of the decision rules from the forest.
For those interested in the full analysis, you’ll have to wait for the paper. As always, there are further considerations for this sort of investigation, not the least of which include testing the RF algorithm against other classification techniques (like neural networks, discriminant analysis, simple classification trees, and others). We’ll have to address those as well as other great comments from those that stopped by at the conference. However, a detailed summary of the current version is in THIS POSTER that we presented at JSM.
Thanks to all of those who stopped by. The conference was a great experience and I hope to return next year!