Scraping XML Tables with R

May 15, 2014
By

(This article was first published on Analyst At Large » R, and kindly contributed to R-bloggers)

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.

Here’s the code:

###### Settings
library(XML)
 
###### URLs
url<-paste0("http://www.basketball-reference.com/players/",letters,"/")
len<-length(url)
 
###### Reading data
tbl<-readHTMLTable(url[1])[[1]]
 
for (i in 2:len)
	{tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}
 
###### Formatting data
colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")
tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at inside-R.org

And here’s the result:Result

 


To leave a comment for the author, please follow the link and comment on his blog: Analyst At Large » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.