Pro Football Data

December 1, 2012
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

I’ve made the acquaintance of a group of data analysts here in the triangle and have agreed to arrange an outing to the Durham Bulls minor league baseball team. Because it’s for stat nerds and because I was curious, I went looking for some baseball data to analyze. I found loads of it here, but soon got distracted by the presence of NFL statistics. The season is already well underway, but I thought it might be fun to try and build a predictive model for the sport.

The first step is to get some data. Here, I use an R function to pull HTML tables from the site.

GetGamesHistory = function(FirstYear = 1985, LastYear = 2011)
{
  games.URL.stem = "http://www.pro-football-reference.com/years/"

  for (year in FirstYear:LastYear)
  {
    URL = paste(games.URL.stem, year, "/games.htm", sep="")

    games = readHTMLTable(URL)

    dfThisSeason = games[[1]]

    # Clean up the df
    dfThisSeason = subset(dfThisSeason, Week!="Week")
    dfThisSeason = subset(dfThisSeason, Week!="")
    dfThisSeason$Date = as.character(dfThisSeason$Date)
    dfThisSeason$GameDate = mdy(paste(dfThisSeason$Date, year))

    year(dfThisSeason$GameDate) = with(dfThisSeason, ifelse(month(GameDate) <=6, year(GameDate)+1, year(GameDate)))

    if (year == FirstYear)
    {
      dfAllSeasons = dfThisSeason
    } else {
      dfAllSeasons = rbind(dfAllSeasons, dfThisSeason)
    }

  }

  dfAllSeasons = dfAllSeasons[,c(14, 1, 5, 7, 8, 9)]

  colnames(dfAllSeasons) = c("GameDate", "Week", "Winner", "Loser", "WinnerPoints", "LoserPoints")

  dfAllSeasons$Winner = as.character(dfAllSeasons$Winner)
  dfAllSeasons$Loser = as.character(dfAllSeasons$Loser)
  dfAllSeasons$WinnerPoints = as.integer(as.character(dfAllSeasons$WinnerPoints))
  dfAllSeasons$LoserPoints = as.integer(as.character(dfAllSeasons$LoserPoints))
  dfAllSeasons$ScoreDifference = dfAllSeasons$WinnerPoints - dfAllSeasons$LoserPoints

  dfAllSeasons = subset(dfAllSeasons, !is.na(ScoreDifference))

  return (dfAllSeasons)

}

Created by Pretty R at inside-R.org

So I wrote this code about a week ago and already I can see that I don’t like it. For one, I try to avoid using loops in R unless absolutely necessary. Often, I’ll start out with one just to get going, but usually I find that they can be replaced with one of the apply functions or something similarly succinct. Two, I need to better understand the behavior of the readHTML function. I remember having gone a couple rounds with the points data, which is read in as a factor. This leads to the extremely ugly bit of code where I convert it to a character and then to an integer. If anyone has a better way, I’m all ears. Three, I need to revisit the basic idea of extracting columns by name. Extraction by number is dangerous and confusing. Finally, I’d like to revise the data cleansing so that it lists the game with home, visitor and winner listed. That would make it easier to test whether or not a home field advantage exists.

All that understood, the code works and gives me piles of data. How I look at it will be the subject of the next post.


To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.