Webscraping using readLines and RCurl

April 14, 2009

(This article was first published on ProgrammingR, and kindly contributed to R-bloggers)

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.


For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")

# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]

# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)

# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)

We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)

# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)

# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)

jan09_parsed <- htmlTreeParse(jan09)

# Continue on similar to above

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

The post Webscraping using readLines and RCurl appeared first on ProgrammingR.

To leave a comment for the author, please follow the link and comment on their blog: ProgrammingR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...


Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)