Use rvest to scrape NFL weather data

January 7, 2016
By

(This article was first published on r-bloggers – verenahaunschmid, and kindly contributed to R-bloggers)

If you are following my progress in the Data Science Learning Club you might know that I am using NFL data for the tasks. For predicting sports events I think it is not only important to have statistics about the players, teams and previous games but also about the weather. From when I was a soccer player I can tell you that it makes quite a difference whether it is snowing, has 30°C or more or the weather is moderate. One could argue that the weather is influence both teams and therefore no one has an advantage, but I think that everyone responds differently to different conditions.

The data source

After only searching for a short time, I found a website called NFLWeather which provides weather forecasts for every match back to 2009.

Web scraping: rvest

I have been looking into web scraping before, but it seemed like a dirty and cumbersome task to me.

Since I made the experience that almost everything related to data has been implemented in a nice way by someone in R I wanted to give it another try. I found the package rvest by @hadleywickham, which is always a very good sign with respect to R package quality.

The code

Checking out their archive I found the structure of their links and that they go back until 2009. So I wrote this method to parse the page, find the first table (there is only one), and convert it to a data.frame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
load_weather<-function(year, week) {
  base_url<-"http://nflweather.com/week/"
  if (year == 2010) { # necessary because of different file naming
    start_url<-paste0(base_url, year, "/", week, "-2/")
  } else {
    start_url<-paste0(base_url, year, "/", week, "/")
  }
  if (year == 2013 &#038;&#038; week == "pro-bowl") {
    return (NULL)
  }
  tryCatch ({  
    page<-html(start_url, encoding="ISO-8859-1") 
    table<-page %>% html_nodes("table")  %>% .[[1]] %>% html_table()
    table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")])
    return(table)
  }, 
 
  error = function(e) { 
    print(paste(e, "Year", y, "Week", w))
    return(NULL)
  }).
}

The function got a lot longer than anticipated, but let me explain it:

  • Parameters: year and week
  • start_url is built from the base_url that’s always the same and the two parameters. The only difference is for year 2010, where for no apparent reason “-2” is added to each link.
  • We have to skip the pro-bowl week in 2013, because that page does not exist.
  • Then we have some error handling because other pages might not exist or might become unavailable.
  • Line 13: I parse the page (actually html is deprecated and read_html should be used but I currently have an older version of R running).
  • Line 14: I use the magrittr pipe operator as used in the package examples, but this can also be done without it. Just see the code below.
  • Line 15: I create a data.frame only selecting the columns I need and by adding the Year and Week information to each row.
1
html_table(html_nodes(page, "table")[[1]])

This is how I call the code to build one large data.frame:

1
2
3
4
5
6
weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0))
for (y in years) {
  for (w in weeks) {
    weather_data<-rbind(weather_data, load_weather(y, w))
  }
}

The output

The output is a data.frame with 2832 rows just like the ones in the screenshot.

Screen Shot 2016-01-07 at 14.48.14

Download complete code

The complete source can be downloaded below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
library(rvest)
 
years<-2009:2015
weeks<-c(paste0("pre-season-week-", 1:4), paste0("week-", 1:17), "wildcard-weekend", "divisional-playoffs", "conf-championships", "pro-bowl", "superbowl")
 
load_weather<-function(year, week) {
  base_url<-"http://nflweather.com/week/"
  if (year == 2010) { # necessary because of different file naming
    start_url<-paste0(base_url, year, "/", week, "-2/")
  } else {
    start_url<-paste0(base_url, year, "/", week, "/")
  }
  if (year == 2013 &#038;&#038; week == "pro-bowl") {
    return (NULL)
  }
  tryCatch ({  
    page<-html(start_url, encoding="ISO-8859-1") 
    table<-page %>% html_nodes("table")  %>% .[[1]] %>% html_table()
    table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")])
    return(table)
  }, 
 
  error = function(e) { 
    print(paste(e, "Year", y, "Week", w))
    return(NULL)
  })
}
 
weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0))
for (y in years) {
  for (w in weeks) {
    weather_data<-rbind(weather_data, load_weather(y, w))
  }
}
 
 
#### code without pipe #### 
 
html_table(html_nodes(page, "table")[[1]])

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – verenahaunschmid.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)