Scrape Web data using R

Posted on August 13, 2010 by -- in R bloggers | 0 Comments

[This article was first published on Brock's Data Adventure » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you. It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.

As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.

One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.

Bring on fantasy football!

################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################

library(XML)
library(stringr)

# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
		"&conference=NFL&year=season_2009",
		"&timeframe=Week1", sep="")

# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]

# select the table we need - read as a dataframe
my.table <- tables[[7]]

# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]

# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
		"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names

# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))

for(f in front) {
	test.front <- as.character(my.table[, f])
	tt.front <- str_sub(test.front, start=3)
	my.table[,f] <- tt.front
}

for(b in back) {
	test <- as.character(my.table[ ,b])
	tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
	my.table[, b] <- tt.back
}

str(my.table)
View(my.table)

# clear memory and quit R
rm(list=ls())
q()
n

Filed under: Fantasy Football, How-to, NFL, R Tagged: fantasy football, R, web scraping