If you are following my progress in the Data Science Learning Club you might know that I am using NFL data for the tasks. For predicting sports events I think it is not only important to have statistics about the players, teams and previous games but also about the weather. From when I was a soccer player I can tell you that it makes quite a difference whether it is snowing, has 30°C or more or the weather is moderate. One could argue that the weather is influence both teams and therefore no one has an advantage, but I think that everyone responds differently to different conditions.

The data source

After only searching for a short time, I found a website called NFLWeather which provides weather forecasts for every match back to 2009.

Web scraping: rvest

I have been looking into web scraping before, but it seemed like a dirty and cumbersome task to me.

Since I made the experience that almost everything related to data has been implemented in a nice way by someone in R I wanted to give it another try. I found the package rvest by @hadleywickham, which is always a very good sign with respect to R package quality.

The code

Checking out their archive I found the structure of their links and that they go back until 2009. So I wrote this method to parse the page, find the first table (there is only one), and convert it to a data.frame:


The function got a lot longer than anticipated, but let me explain it:

  • Parameters: year and week
  • start_url is built from the base_url that’s always the same and the two parameters. The only difference is for year 2010, where for no apparent reason “-2” is added to each link.
  • We have to skip the pro-bowl week in 2013, because that page does not exist.
  • Then we have some error handling because other pages might not exist or might become unavailable.
  • Line 13: I parse the page (actually html is deprecated and read_html should be used but I currently have an older version of R running).
  • Line 14: I use the magrittr pipe operator as used in the package examples, but this can also be done without it. Just see the code below.
  • Line 15: I create a data.frame only selecting the columns I need and by adding the Year and Week information to each row.
html_table(html_nodes(page, "table")[[1]])

This is how I call the code to build one large data.frame:


The output

The output is a data.frame with 2832 rows just like the ones in the screenshot.

Screen Shot 2016-01-07 at 14.48.14

Download complete code

The complete source can be downloaded below.



