If you are following my progress in the Data Science Learning Club you might know that I am using NFL data for the tasks. For predicting sports events I think it is not only important to have statistics about the players, teams and previous games but also about the weather. From when I was a soccer player I can tell you that it makes quite a difference whether it is snowing, has 30°C or more or the weather is moderate. One could argue that the weather is influence both teams and therefore no one has an advantage, but I think that everyone responds differently to different conditions.
The data source
After only searching for a short time, I found a website called NFLWeather which provides weather forecasts for every match back to 2009.
Web scraping: rvest
I have been looking into web scraping before, but it seemed like a dirty and cumbersome task to me.
Since I made the experience that almost everything related to data has been implemented in a nice way by someone in R I wanted to give it another try. I found the package rvest by @hadleywickham, which is always a very good sign with respect to R package quality.
Checking out their archive I found the structure of their links and that they go back until 2009. So I wrote this method to parse the page, find the first table (there is only one), and convert it to a data.frame:
The function got a lot longer than anticipated, but let me explain it:
- Parameters: year and week
- start_url is built from the base_url that’s always the same and the two parameters. The only difference is for year 2010, where for no apparent reason “-2” is added to each link.
- We have to skip the pro-bowl week in 2013, because that page does not exist.
- Then we have some error handling because other pages might not exist or might become unavailable.
- Line 13: I parse the page (actually html is deprecated and read_html should be used but I currently have an older version of R running).
- Line 14: I use the magrittr pipe operator as used in the package examples, but this can also be done without it. Just see the code below.
- Line 15: I create a data.frame only selecting the columns I need and by adding the Year and Week information to each row.
This is how I call the code to build one large data.frame:
The output is a data.frame with 2832 rows just like the ones in the screenshot.
Download complete code
The complete source can be downloaded below.