Use rvest to scrape NFL weather data

Posted on January 7, 2016 by Verena in R bloggers | 0 Comments

[This article was first published on r-bloggers – verenahaunschmid, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you are following my progress in the Data Science Learning Club you might know that I am using NFL data for the tasks. For predicting sports events I think it is not only important to have statistics about the players, teams and previous games but also about the weather. From when I was a soccer player I can tell you that it makes quite a difference whether it is snowing, has 30°C or more or the weather is moderate. One could argue that the weather is influence both teams and therefore no one has an advantage, but I think that everyone responds differently to different conditions.

The data source

After only searching for a short time, I found a website called NFLWeather which provides weather forecasts for every match back to 2009.

Web scraping: rvest

I have been looking into web scraping before, but it seemed like a dirty and cumbersome task to me.

Since I made the experience that almost everything related to data has been implemented in a nice way by someone in R I wanted to give it another try. I found the package rvest by @hadleywickham, which is always a very good sign with respect to R package quality.

The code

Checking out their archive I found the structure of their links and that they go back until 2009. So I wrote this method to parse the page, find the first table (there is only one), and convert it to a data.frame:

load_weather<-function(year, week) {
  base_url<-"http://nflweather.com/week/"
  if (year == 2010) { # necessary because of different file naming
    start_url<-paste0(base_url, year, "/", week, "-2/")
  } else {
    start_url<-paste0(base_url, year, "/", week, "/")
  }
  if (year == 2013 && week == "pro-bowl") {
    return (NULL)
  }
  tryCatch ({  
    page<-html(start_url, encoding="ISO-8859-1") 
    table<-page %>% html_nodes("table")  %>% .[[1]] %>% html_table()
    table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")])
    return(table)
  }, 
  
  error = function(e) { 
    print(paste(e, "Year", y, "Week", w))
    return(NULL)
  }).
}

The function got a lot longer than anticipated, but let me explain it:

Parameters: year and week
start_url is built from the base_url that's always the same and the two parameters. The only difference is for year 2010, where for no apparent reason "-2" is added to each link.
We have to skip the pro-bowl week in 2013, because that page does not exist.
Then we have some error handling because other pages might not exist or might become unavailable.
Line 13: I parse the page (actually html is deprecated and read_html should be used but I currently have an older version of R running).
Line 14: I use the magrittr pipe operator as used in the package examples, but this can also be done without it. Just see the code below.
Line 15: I create a data.frame only selecting the columns I need and by adding the Year and Week information to each row.

html_table(html_nodes(page, "table")[[1]])

This is how I call the code to build one large data.frame:

weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0))
for (y in years) {
  for (w in weeks) {
    weather_data<-rbind(weather_data, load_weather(y, w))
  }
}

The output

The output is a data.frame with 2832 rows just like the ones in the screenshot.

Download complete code

The complete source can be downloaded below.

library(rvest)

years<-2009:2015
weeks<-c(paste0("pre-season-week-", 1:4), paste0("week-", 1:17), "wildcard-weekend", "divisional-playoffs", "conf-championships", "pro-bowl", "superbowl")

load_weather<-function(year, week) {
  base_url<-"http://nflweather.com/week/"
  if (year == 2010) { # necessary because of different file naming
    start_url<-paste0(base_url, year, "/", week, "-2/")
  } else {
    start_url<-paste0(base_url, year, "/", week, "/")
  }
  if (year == 2013 && week == "pro-bowl") {
    return (NULL)
  }
  tryCatch ({  
    page<-html(start_url, encoding="ISO-8859-1") 
    table<-page %>% html_nodes("table")  %>% .[[1]] %>% html_table()
    table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")])
    return(table)
  }, 
  
  error = function(e) { 
    print(paste(e, "Year", y, "Week", w))
    return(NULL)
  })
}

weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0))
for (y in years) {
  for (w in weeks) {
    weather_data<-rbind(weather_data, load_weather(y, w))
  }
}


#### code without pipe #### 

html_table(html_nodes(page, "table")[[1]])

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – verenahaunschmid.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Use rvest to scrape NFL weather data

The data source

Web scraping: rvest

The code

The output

Download complete code

Related

The data source

Web scraping: rvest

The code

The output

Download complete code

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)