I have been sitting on this post for some time now and wanted to get it out there. The goal is to simply show how easy it is to pull live data from the web into R, massage it, and perform some analytics on it. I am not sure how useful this analysis really is in practice, but the larger point is to show you how powerful R is for very quick analysis.
I admit that I am a somewhat sloppy coder, but hopefully my comments may help you out, especially if you are new to R and are interested in things like:
- How to sample data (both rows and columns)
- Recode values
- Re-order factors
- Reduce the data using Principal components
- Cluster the data using these components
- Basic plotting and how can control everything you want on the plot
The code can be found here. The plots below show you some of the output.
As mentioned above, this wasn’t aimed at being a in-depth review of team performance or skater ability, but I think you can see where this analysis could go. The aim of the team distribution plot is to show the team distribution by their skaters, with reference lines that would break up the teams into 4 equal size groups.
If you follow the NHL, take a look at New Jersey or Toronto. These two teams are not having the best seasons, and using this plot, more than half of their team is comprised of skaters who fall into the lower 2 performing clusters. In addition, look at Philadelphia, one of the better teams in the league. More than 25% of their team was clustered into the top performing group.