Stealing from the internet: Part 1

[This article was first published on R is my friend » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, not stealing but rather some handy tools for data mining… About a year ago I came across the package XML as I was struggling to get some data from various web pages. The purpose of this blog is to describe how this package can be used to quickly gather data from the internet. I’ll first describe how the functions are used and then show how they can be included in a custom function to quickly ‘steal’ what we need.

I realize that data mining from internet sources is not a new concept by any means (a more thorough intro), but the practice is new to me and I have found it very useful in the applications I’ve attempted. For example, I routinely collect supporting data for large datasets that describe lakes in Minnesota. A really useful website that contains a lot of this information is LakeFinder hosted by the Minnesota Department of Natural Resources. The website can be used to access all sorts of information about a lake just by punching in a lake name or unique 8-digit id number. Check out Lake Minnetonka (remember Purple Rain??). The page has lots of information… lake characteristics, fish surveys, consumption warnings, etc. Also note the URL. The last 8 digits are the unique lake ID for Minnetonka assigned by MNDNR. This part of the URL comes into play later.

What if I want information for a couple dozen lakes or even several hundred (yea, or over 10,000)? We have a few options. Obviously we could go through lake by lake and copy the information by hand, but this causes headaches and eventually a desire to harm one’s self. We could also contact the site administrator and batch request the data directly, but this may take some time or may require repeated requests depending on our needs. As you’ll probably soon realize, I hate doing things that are inefficient and the XML package provides us with some useful tools to quickly gather information from the internet.

I’ll walk through an example that shows how we can get maximum depth of the lake by accessing the HTML directly from the website. As a disclaimer, I have very little experience with HTML or any other markup languages (other than LaTeX) so I encourage feedback if the approach below can be implemented more efficiently.

#install and load packages install.packages('XML') library(XML)

The XML package has tons of functions and I’m not going to pretend like I understand them all. However, the htmlTreeParse function (or xmlTreeParse) can import raw HTML code from web pages and will be useful for our purposes. Let’s import the HTML code for Lake Minnetonka (remember the last 8 digits of the URL describe the unique lake ID).

html.raw<-htmlTreeParse( 'http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27013300', useInternalNodes=T ) html.parse<-xpathApply(html.raw, "//p", xmlValue)

The html.raw object is not immediately useful because it literally contains all of the raw HTML for the entire webpage. We can parse the raw code using the xpathApply function which parses HTML based on the path argument, which in this case specifies parsing of HTML using the paragraph tag.

We now have a list of R objects once we use xpathApply, so we don’t have to mess with HTML/XML anymore. The trick is to parse the text in this list even further to find what we want. If we go back to the Lake Minnetonka page, we can see that ‘Maximum Depth (ft)’ precedes the listed depth for Lake Minnetonka. We can use this text to find the appropriate list element in html.parse that contains the depth data using the grep function from the base package.

robj.parse<-grep('*Depth*',unlist(html.parse),value=T)

It’s not a huge surprise that we can get a return from grep for more than one element that contains ‘depth’. We’ll need to select the correct element that has the depth information we want (in this case, the first element) and further parse the string using the strsplit function.

robj.parse<-robj.parse[[1]] #select first element depth.parse<-as.numeric(strsplit(strsplit(robj.parse,'ft): ')[[1]][2],'Water')[[1]][1]) The code for depth.parse is really messy but all it does is make two calls to strsplit to grab the depth value based on the text that is directly before and after the info we need 'ft): ' and 'Water', respectively). The final value is converted from a text to numeric object. Seasoned programmers will probably cringe at this code since it will not return the correct value if the web site changes in any way. Yeah, this isn't the most orthodox way of coding but it works for what we need. Undoubtedly there are more robust ways of getting this information but this works just fine for static websites.

Additionally, we can combine all the code from above into a function that parses everything at once.

depth.fun<-function(lake){ url.in<-paste( 'http://www.dnr.state.mn.us/lakefind/showreport.html?downum', lake, sep='=' ) html.raw<-htmlTreeParse(url.in,useInternalNodes=T) html.parse<-xpathApply(html.raw, path="//p", fun=xmlValue) robj.parse<-grep('*Depth*',unlist(html.parse),value=T) depth.parse<-as.numeric( strsplit(strsplit(robj.parse,'ft): ')[[1]][2],'Water')[[1]][1] ) return(depth.parse) } depth.fun('27013300')

All we do now is put in the 8-digit lake identifier (as a character string) and out comes the depth. We can make repeated calls to the function to get data for any lake we want, so long as we know the 8-digit identifier. The lake id number is critical because this defines where the raw HTML comes from (i.e., what URL is accessed). Notice that the first part of depth.fun pastes the input id text with the URL, which is then passed to later functions.

Here's an example getting the information for several lakes using sapply to make repeated calls to depth.fun.

lake.ids<-c('27013700','82004600','82010400') sapply(lake.ids,depth.fun)

It's easy to see how we can use the function to get data for lots of lakes at the same time, although this example is trivial if you don't care about lakes. However, I think the code is useful if you ever encounter a situation where data are stored online with predictable URLs with predictable text strings surrounding the data you need. I've already used variants of this code to get other data (part 2 on the way). The only modification required to make the function useful for gathering other data is changing the URL and whatever text needs to be parsed to get what you need. So there you have it.

Get all the code here:


To leave a comment for the author, please follow the link and comment on their blog: R is my friend » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)