Plot the Scoring Streak of an NHL Player with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am a big Boston Bruins fan and have enjoyed the ups and downs over the last few years, regardless of the catastrophes that have occurred during the playoffs. The team struggled a few weeks ago, but have recently seemed to find their stride.
During that time frame, in my opinion Nathan Horton was a significant factor in those wins. For a long chunk of the season though, it felt like he was in a rut. It got me thinking about how we could actually use some data to see how “streaky” a player is. In this case, Nathan Horton.
The code below uses R to collect what we need from the web and plot the cumulative goals over the course of a season. Each dot on the line should represent a game, so it is easy to view games played versus production across each season of a career.
A big thank you to @Bernd on stackoverflow for his help and Hadley for making some great R packages. Also, I am starting to warm up to R Studio and think its a great tool for those coming from software like SAS or SPSS..
Resulting plot:
| ## basics | |
| # R 2.12.2 | |
| # windows xp; Yes, I know | |
| ## libraries | |
| library(XML) | |
| library(plyr) | |
| library(lubridate) | |
| library(ggplot2) | |
| # Set the working directory | |
| setwd("~/My Dropbox/Eclipse/Projects/R/NHL/Blog Posts/Player Streakiness") | |
| # Set the constants | |
| BASE <- "http://www.hockey-reference.com/players/h/hortona01/gamelog/" | |
| SEASON <- c(2004, 2006:2011) | |
| # Loop and grab the data | |
| ds <- data.frame() | |
| for (S in SEASON) { | |
| URL <- paste(BASE, S, "/", sep="") | |
| tables <- readHTMLTable(URL)$stats | |
| head(tables, n=30) | |
| # fix factors and names | |
| for(i in 1:ncol(tables)) { | |
| tables[,i] <- as.character(tables[,i]) | |
| names(tables) <- tolower(colnames(tables)) | |
| } | |
| tables | |
| str(tables) | |
| names(tables)[6] <- "AwayHome" | |
| names(tables)[8] <- "WinLoss" | |
| names(tables)[9] <- "goals" | |
| names(tables) | |
| # fix the columns - NAs forced by coercion warnings | |
| str(tables) | |
| for(i in c(1:2, 9:19)) { | |
| tables[,i] <- as.numeric(tables[, i]) | |
| } | |
| str(tables) | |
| tables$year <- S | |
| ds <- rbind.fill(ds, tables) | |
| # BE KIND when scraping | |
| Sys.sleep(10) | |
| } | |
| with(ds, table(year)) | |
| head(ds, n=30) | |
| dim(ds) | |
| ds<- ds[!is.na(ds$rk), ] | |
| dim(ds) | |
| head(ds, n=30) | |
| save(ds, file="Horton.Rdata") | |
| # Need to change the date to an actual date in R | |
| str(ds) | |
| ds$date <- parse_date(ds$date, c("%Y", "%m", "%d"), seps="-") | |
| str(ds) | |
| # Format to the month year = do so by setting all with the same arbitrary year | |
| # Set the last months of the season to the year plus 1 so the dates are in "order" when plotted | |
| ds$date <- update(ds$date, year=2010) | |
| ds$date[month(ds$date) < 10] <- update(ds$date[month(ds$date) < 10], year=2011) | |
| head(ds, n=40) | |
| # Help recieved from | |
| # http://stackoverflow.com/questions/5494216/extract-date-in-r | |
| # add cumulative goals by season and make a new dataframe | |
| gamelog <- ddply(ds, .(year), transform, cumegoals = cumsum(goals)) | |
| # plot the data | |
| ggplot(aes(y=cumegoals, x=date), data=gamelog) + geom_point() + geom_line() + | |
| facet_wrap(~year, ncol=1) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
