**Fear and Loathing in Data Science**, and kindly contributed to R-bloggers)

Spring is at hand and it is a time of renewal, March Madness and to settle scores in the NHL. There are many scores to be settled: Flyers vs. Penguins, Blackhawks vs. Red Wings, Leafs vs. Habs and pretty much everyone else vs. the Bruins. Like any fire and brimstone hockey fan, clutching my old testament (a well-worn VHS copy of Slapshot), I will be watching with passionate interest as the NHL unleashes a hellish intensity as it comes down the home stretch. What does it take to win? I stumbled upon team statistics on nhl.com and decided to have a look with some basic linear regression. I created a simple .csv file of team stats with 24 variables for each of the 30 teams as of the morning of March 16.

[1] “Team” “gp” “wins”

[4] “losses” “ot_losses” “points”

[7] “reg_ot” “home_row” “road_row”

[10] “point_percent” “g_game” “ga_game”

[13] “fiveon5_ratio” “play_percent” “pk_percent”

[16] “shots_game” “sa_game” “sfirst_percent”

[19] “trfirst_percent” “lfirst_percent” “lsecond_percent”

[22] “win_outshoot” “win_outshot” “fo_percent”

[25] “pim”

The dependent variable here will be “point_percent”, which is the percentage points a team achieves per game played. Since teams have played different numbers of games and because of the NHL point system (I would get into that here) it is the best reflection of team performance. I’ll discuss other variables in a minute.

Let’s get rid of unneeded variables…

> nhl.sub = nhl.3.16[ ,c(1,10:25)]

> #examine a correlation plot

> require(corrplot)

> nhl.corr = cor(nhl.sub[2:17])

> corrplot(nhl.corr, method=”ellipse”)

Setting aside goals scored and allowed per game (g_game, ga_game), we see that we have high correlation the ratio of goals scored versus opponents when teams are at full strength (fiveon5_ratio), winning percentage when scoring first (sfirst_percent), winning when outshooting an opponent (win_outshoot) and when outshot (win_outshot). I was surprised that penalty minutes per game (pim) was not correlated with anything.

> #re-make nhl.sub with no goals or goals allowed per game

> nhl.sub = nhl.3.16[ ,c(10,13:25)]

> require(leaps)

> #use the leaps package for best subsets

> #the package defaults to 8 subsets, but I set to 10 for demonstration purposes

> bestsubsets = regsubsets(point_percent~., nhl.sub, nvmax=10)

> #create an object of regsubsets for further evaluation and model selection

> reg.summary = summary(bestsubsets)

> names(reg.summary)

[1] “which” “rsq” “rss” “adjr2” “cp” “bic” “outmat”

[8] “obj”

> #plot of residual sum of squares

> plot(reg.summary$rss ,xlab=”Number of Variables “,ylab=”RSS”, type=”l”)

> #looks like somewhere between 4 and 7 variables would be an optimal model…plot Mallow’s Cp next

> plot(reg.summary$cp ,xlab=”Number of Variables “,ylab=”Cp”, type=”l”)

> #5 variables? let’s confirm…

> which.min(reg.summary$cp )

[1] 5

> plot(bestsubsets ,scale=”Cp”)

The way to read this plot is to look at the top (lowest cp) and a black box corresponds to a variable on the x axis. So here we see that the 5 on 5 ratio, power play percentage, penalty kill percentage, percentage of wins when scoring first and percentage of wins when trailing after the first period make the cut.

> #examine the coefficients

> coef(bestsubsets ,5)

(Intercept) fiveon5_ratio play_percent pk_percent

-0.220482142 0.180024350 0.004574047 0.002724056

sfirst_percent trfirst_percent

0.303930780 0.277814196

Here is some additional analysis I did to look at an ANOVA table and examine residuals i.e. confirm our linear model assumptions, I won’t bore you with the details:

> nhl.lm = lm(point_percent ~ fiveon5_ratio + play_percent + pk_percent + sfirst_percent + trfirst_percent)

> anova(nhl.lm)

> summary(nhl.lm)

> plot(nhl.lm)

> #create a plot of actual versus predicted with team names for your viewing pleasure

> nhl.3.16[“pred”] = NA

> nhl.3.16$pred = predict(nhl.lm)

> require(ggplot2)

> p = ggplot(nhl.3.16, aes(x=pred, y=point_percent, label=Team))

> p + geom_point() + geom_text(size=3, hjust=-.1, vjust=0)

By the way, the adjusted R-squared was 0.9614. Interesting how we see the elite teams like St. Louis and Anaheim clustering together and the has-been teams clustering as well. Teams like Tampa, Washington and Dallas will all be fighting it out for those playoff spots to the glorious and bitter end.

Mahalo,

Cory

**leave a comment**for the author, please follow the link and comment on their blog:

**Fear and Loathing in Data Science**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...