Using maps and ggplot2 to visualize college hockey championships

[This article was first published on Decisions and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Short:
I plot the frequency of college hockey championships by state using the maps package, and ggplot2

Note: this example is based heavily on the example provided at
http://www.dataincolour.com/2011/07/maps-with-ggplot2/

data reference:
http://en.wikipedia.org/wiki/NCAA_Men%27s_Ice_Hockey_Championship

Question of interest
As a good Minnesotan, I've believed for quite some time that the colder, Northern states enjoy a competitive advantage when it comes to college hockey. Does this advantage exist? How strong is it?

I first downloaded data from wikipedia on past winners of hockey championships, and saved the short list in an excel csv file.

After saving the file, here's how the data look in R:

# Visualizing College Hockey Champions by State

# Author: Mark T Patterson Date: March 13, 2013


# Libraries:
library(ggplot2)
library(maps)

# Changing library:
rm(list = ls())  # Clearing the work bench
setwd("C:/Users/Mark/Desktop/Blog/Data")

# Loading Data:


# Loading state championships data:
dat.state = read.csv("HockeyChampsByState.csv", header = TRUE)
dat.state$state = tolower(dat.state$state)
head(dat.state)

##           state titles
## 1      michigan     19
## 2 massachusetts     11
## 3      colorado      9
## 4  north dakota      7
## 5     minnesota      6
## 6     wisconsin      6

Now that we've loaded the information about hockey championships by state, we just need to load the mapping data. map_data(state') is a dataframe in the maps package. Here, we'll use the region column, which lists state names, to match our state championship data.

# Creating mapping dataframe:
us.state = map_data("state")
head(us.state)

##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>


# Merging the two datasets:

dat.champs = merge(us.state, dat.state, by.x = "region", by.y = "state", 
    all = TRUE)

dat.champs <- dat.champs[order(dat.champs$order), ]
# mapping requires the same order of observations that appear in us.state

head(dat.champs)

##    region   long   lat group order subregion titles
## 1 alabama -87.46 30.39     1     1      <NA>     NA
## 2 alabama -87.48 30.37     1     2      <NA>     NA
## 3 alabama -87.53 30.37     1     3      <NA>     NA
## 4 alabama -87.53 30.33     1     4      <NA>     NA
## 5 alabama -87.57 30.33     1     5      <NA>     NA
## 6 alabama -87.59 30.33     1     6      <NA>     NA

With the dat.champs frame created, we're ready to plot

# Plotting

(qplot(long, lat, data = dat.champs, geom = "polygon", group = group, 
    fill = titles) + theme_bw() + labs(x = "", y = "", fill = "") + scale_fill_gradient(low = "#EEEEEE", 
    high = "darkgreen") + opts(title = "College Hockey Championships By State", 
    legend.position = "bottom", legend.direction = "horizontal"))

plot of chunk unnamed-chunk-3

Having plotted the data, it's easy to see the effect of the 'great lakes' region on hockey championships. With the exception of Colorado, only Northern, colder states have won titles.

Ways to improve this analysis
While we observe that college title champions are clustered in the Northern Midwest and Northern East, it's possible that several variables could explain the distribution. We might consider examining 1) state temperature (we might expect that colder temperatures lead to better performance, since teams in colder states get to practice more), 2) distance from great lakes (this might be a proxy for the availability of ice), 3) distance from Canadian hockey cities (it's possible that hockey culture follows from Canadian or other European immigration).

Beyond examining these possible factors, it'd be interesting to try color presentations – I've adopted the same color scheme presented at http://www.dataincolour.com/2011/07/maps-with-ggplot2/ , but it would be good to have some familiarity with other schemes.

To leave a comment for the author, please follow the link and comment on their blog: Decisions and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)