Scrape Web data using R

August 13, 2010
By

(This article was first published on Brock's Data Adventure » R, and kindly contributed to R-bloggers)

Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you.  It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.

As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.

One note.  When I read in my table, it contained a wierd set of characters.  I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.

Bring on fantasy football!

################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################

library(XML)
library(stringr)

# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
		"&conference=NFL&year=season_2009",
		"&timeframe=Week1", sep="")

# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]

# select the table we need - read as a dataframe
my.table <- tables[[7]]

# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]

# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
		"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names

# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))

for(f in front) {
	test.front <- as.character(my.table[, f])
	tt.front <- str_sub(test.front, start=3)
	my.table[,f] <- tt.front
}

for(b in back) {
	test <- as.character(my.table[ ,b])
	tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
	my.table[, b] <- tt.back
}

str(my.table)
View(my.table)

# clear memory and quit R
rm(list=ls())
q()
n

Filed under: Fantasy Football, How-to, NFL, R Tagged: fantasy football, R, web scraping

To leave a comment for the author, please follow the link and comment on his blog: Brock's Data Adventure » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.