NHL Statistics – Goals scored by age

June 26, 2011
By

(This article was first published on Rock 'n' R » R, and kindly contributed to R-bloggers)

NHL Statistics, part 1

Goals scored by age

Data Twirling blog gave instructions to how to get NHL statistics data from the website and I saw an opportunity to learn R and statistics with help of data I know and understand so it would be easier to see when my graphs show bad information.

When talking about hockey, the scoring is one the most important thing and everybody loves scorers – at leats as long they are on their favourite teams. After Teemu Selänne of Anaheim Ducks played magnificient season at age of 40 I wanted to know what kind of relevance does a age have on scoring and why.


Upper chart is the season 2010-2011 and the lower chart is seasons 1960-2011. The red dots are the median of goals scored by the age, red line is median of all ages, blue dots are the maximum goals scored by the age and the blue line is 50 goals which is kind of a milestone for scoring in one season.

Before publishing this I went through many different combinations and one interesting thing was that in seasons 2000 to 2011, players at ages of 26 to 29 seemed to play much worse than players at the ages of 30 to 33 by median.

I was already going to make an assumption that players had their best years after they turned 30 (I still didn’t have the max goals in the chart). Then I started thinking about and realised that there is a reason why < 20 yrs olds and >30 yrs olds have higher medians. Player who’s 18 and plays in the NHL has to be good or he wouldn’t be there. So it rises the median when there are only few players and they’re all best of their age. Later, starting from 20 more and more standard players come along as teams need players in addition to superstars. Same goes when player turns 30. If you are not a good enough player, you’ll be sold away or sent to AHL and replaced with younger stars. So only a few good players stay and medians keep rising. In their 40s there are players who play because they are good (like our own Teemu Selänne) and players who play because they’re some kind of loyal franchise players who none will fire or sell.

The basic code is adapted from Data Twirling blog I mentioned in the beginning and I just add my own things. So credit goes there.

SkaterStats.R is from the blog and I made no changes there

#######################################################################################
# Function to scrape season skater statistics from Hockey-reference.com
#######################################################################################
GrabSkaters <- function(S) {

# The function takes parameter S which is a string and represents the Season
# Returns: data frame

require(XML)

## create the URL
URL <- paste("http://www.hockey-reference.com/leagues/NHL_",
S, "_skaters.html", sep="")

## grab the page -- the table is parsed nicely

#for reading from Internet
tables <- readHTMLTable(URL)

#for reading local file so the servers won't get extra load
#tables <- read.csv('NHL.csv')

ds.skaters <- tables$stats

## determine if the HTML table was well formed (column names are the first record)
## can either read in directly or need to force column names
## and

## I don't like dealing with factors if I don't have to
## and I prefer lower case
for(i in 1:ncol(ds.skaters)) {
ds.skaters[,i] <- as.character(ds.skaters[,i])
names(ds.skaters) <- tolower(colnames(ds.skaters))
}

## fix a couple of the column names
colnames(ds.skaters)
## names(ds.skaters)[10] <- "plusmin"
names(ds.skaters)[11] <- "plusmin"
names(ds.skaters)[18] <- "spct"

## finally fix the columns - NAs forced by coercion warnings
for(i in c(1, 3, 6:18)) {
ds.skaters[,i] <- as.numeric(ds.skaters[, i])
}

## convert toi to seconds, and seconds/game
## ds.skaters$seconds <- (ds.skaters$toi*60)/ds.skaters$gp

## remove the header and totals row
ds.skaters <- ds.skaters[!is.na(ds.skaters$rk), ]
## ds.skaters <- ds.skaters[ds.skaters$tm != "TOT", ]

## add the year
ds.skaters$season <- S

## return the dataframe
return(ds.skaters)

median_of_goals.R is partly from the blog and partly my writing

## Creates a plot of goal medians by age of wanted season.
## uses SkaterStats.R by Data Twirling blog (http://www.brocktibert.com/blog/)

# Source the file with the Grab Skaters function
library("ggplot2")
source("SkaterStats.R")

#-----------------------------------------------------------------------
# Use the function to loop over the seasons and piece together
#-----------------------------------------------------------------------

## define the seasons -- 2005 dataset doesnt exist
## if I was a good coder I would trap the error, but this works
SEASON <- as.character(c(1960:2004,2006:2011))

## create an empy dataset that we will append to
dataset <- data.frame()

## loop over the seasons, use the function to grab the data
## and build the dataset
for (S in SEASON) {

require(plyr)

temp <- GrabSkaters(S)
dataset <- rbind.fill(dataset, temp)
print(paste("Completed Season ", S, sep=""))

## pause the script so we don't kill their servers
Sys.sleep(10)

}

dataset <- dataset[dataset$tm != 'TOT', ]

## UNTIL HERE, CODE IS FROM DATA WHIRLING BLOG
## STARTING FROM HERE IS MY CODE

## select by season - I have both all time and latest season stats done at once
alltime <- sqldf("SELECT * FROM dataset")
age2011 <- sqldf("SELECT * FROM dataset WHERE season=2011")

## sort agelist by age
age2011 <- age2011[with(age2011, order(age2011$age)),]
alltime <- alltime[with(alltime, order(alltime$age)),]

## Create a dataframe with unique ages
dfage <- data.frame(unique(age2011$age))
dfage.alltime <- data.frame(unique(alltime$age))

## rename column
names(dfage) <- "age"
names(dfage.alltime) <- "age"

# Count medians of goals by age
gmedians <- tapply(age2011$g, age2011$age, median)
alltime.gmedians <- tapply(alltime$g, alltime$age, median)

# Add gmedians to dfage data frame
dfage$gmedians <- gmedians
dfage.alltime$gmedians <- alltime.gmedians

# Modify the plot theme
th = theme_bw()
th$panel.background
theme_rect(fill = "white", colour = NA)
th$panel.background = theme_rect(fill = "white", colour = NA)
theme_set(th)

#Count maximum scored goals by age
gmax <-tapply(age2011$g, age2011$age, max)
dfage$gmax <- gmax

alltime.gmax <-tapply(alltime$g, alltime$age, max)
dfage.alltime$gmax <- alltime.gmax

# Create the plot
plot <- ggplot(dfage, aes(x=age, y=gmedians)) + geom_point(colour="red") +
geom_point(aes(x=dfage$age, y=dfage$gmax), colour="blue") +
geom_hline(yintercept = mean(dfage$gmedians), colour="red", size=0.5) +
geom_hline(yintercept = 50, colour="blue", size=0.5) +
scale_x_continuous("Age", breaks=c(min(dfage$age):max(dfage$age))) +
scale_y_continuous("Median of goals", breaks=c(min(dfage$gmedians):max(dfage$gmax))) +
opts(title="NHL 2011 Season - Median of goals by age", breaks=c(min(dfage$gmedians):max(dfage$gmax)))

alltime.plot <- ggplot(dfage.alltime, aes(x=age, y=gmedians)) + geom_point(colour="red") +
geom_point(aes(x=dfage.alltime$age, y=dfage.alltime$gmax), colour="blue") +
geom_hline(yintercept = mean(dfage.alltime$gmedians), colour="red", size=0.5) +
geom_hline(yintercept = 50, colour="blue", size=0.5) +
scale_x_continuous("Age", breaks=c(min(dfage.alltime$age):max(dfage.alltime$age))) +?
scale_y_continuous("Median of goals", breaks=c(min(dfage.alltime$gmedians):max(dfage.alltime$gmax))) +
opts(title="NHL 1960-2011 Seasons - Median of goals by age")

To leave a comment for the author, please follow the link and comment on his blog: Rock 'n' R » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.