ScraperWiki in R

July 29, 2012
By

(This article was first published on factbased, and kindly contributed to R-bloggers)

ScraperWiki describes itself as an online tool for gathering, cleaning and analysing data from the web. It is a programming oriented approach, users can implement ETL processes in Python, PHP or Ruby, share these processes among the community (or pay for privacy) and schedule automated runs. The software behind the service is open source, and there is also an offline version of the program library.

As far as I know, ScraperWiki has no R support yet. This is where the scrape method of my datamart package chimes in. It provides an S4 generic for storing scraped data into an offline database. It is not a wiki (however collaboration can take place using r-forge or CRAN), has no scheduling support (scheduling the start of R scripts should take place outside R) and is not intended as a web application.

Example use case — Berlin’s bathing water quality

Here is an example use case. It is (finally) summer in Berlin and the warm weather invites to go swimming. For the health-aware and/or data-affine people (or those with strong imagination) the public authorities provide the latest water quality assessments online for around 30 bathing areas as open data.

Only the last measurement is available. Now if we were interested in collecting the quality measurements, maybe to find a trend or to search for contributing factors, we would need a mechanism to regularly (in this case, weekly) scrape the data via the online API and save the data locally.

With the framework proposed in datamart, once the details are defined, the mechanism would be executed with just three lines

> ds <- datastore("path/to/local.db")
> bq <- be_bathing_water_quality()
> scrape(bq, ds) # get & save the data

The rest of this blog post is on defining the process, which boils down to defining the be_bathing_water_quality function.

ETL process — extract, transform, load

The task the be_bathing_water_quality function has to accomplish is to define an three-step process by passing the process details to the urldata function. In the database or data warehouse context this process is often refered to as ETL process. In our context, the steps are:

• After mapping a resource name to an URL the data is extracted, i.e. downloaded. If the network is not available, an authentication failed, or similar, the process ends here.
• The extracted data is then transformed into an object R can work with. Currently, only data.frame objects are supported, xts and igraph are envisioned. If the extracted data was not as expected and the tranformation failed, the process ends here.
• The tidy data is then stored (or loaded) int a local sqlite database.

The urldata function provides parameters for each of these steps.

In the bathing water quality example, there is on URL for all bathing areas. We map the resource BathingWaterQuality to this URL. the data is returned as JSON, which is why we use fromJSON as extraction function. The data is then transformed into a data.frame. (We do not go into into every detail of that transformation step.) Hence, the call to urldata is:

> be_bathing_water_quality <- function() urldata(
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   transform.fct=function(x) {
+     tbl <- x[["index"]]
+     nm <- names(tbl[[1]])
+     dat <- as.data.frame(
+       matrix(NA, length(tbl), length(nm))
+     )
+     colnames(dat) <- nm
+     for(i in 1:length(tbl))
+        dat[i,nm] <- tbl[[i]][nm]
+     #some steps omitted...
+     return(dat)
+   }
+ )

Now, we can create an data object, inspect it using the queries method, and access the latest measurements form the web using query:

> bq <- be_bathing_water_quality()
> queries(bq)
[1] "BathingWaterQuality"
> nrow(query(bq, "BathingWaterQuality"))
[1] 38

The urldata function as part of the datamart package has been described in more detail in an earlier post on the Gapminder datasets.

In order to save the scraped data, one more step is necessary. The urldata provides a parameter scrape.lst for specifying which resources to save locally, and how:

> be_bathing_water_quality <- function() urldata(
+   template="...", # see above
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   #transform.fct= # see above
+   scrape.lst=list(
+     BathingWaterQuality=list(
+       saveMode="u",
+       dates="dat"
+     )
+   )
+ )

The name(s) of the scrape.lst argument must be a subset of the names of map.lst. The entries of the list are passed as arguments to dbWriteTable: - saveMode="u" indicates that existing observations should be replaced and new observations appended, uniq defines the columns that determine if an observation exists or not. In the example, an observation is identified by the name of the bathing area (id, badname) and the measurement date (dat). If a place/date combination already exists both in the database and in the newly scraped data, the row in the database gets overwritten. - There are other options to dbWriteTable such as specification of columns with dates or timestamps to enforce data type conversion.

Now with this extended function definition we can decide to use local or web data:

> bq <- be_bathing_water_quality()
> ds <- datastore(":memory:") # not persistent
> scrape(bq, ds)
> #
> #local:
> system.time(query(bq, "BathingWaterQuality", dbconn=ds))
user  system elapsed
0       0       0
> #
> #web:
> system.time(query(bq, "BathingWaterQuality"))
user  system elapsed
0.04    0.00    0.23 

Usually, the local data is accessed faster. Additionally, it is possible to keep observations as long as we want to.

Conclusion

This blog post belongs to a series of articles on a S4 data concept I am currently prototyping. Currently, the framework allows to access and query “conventional” internal datasets, SPARQL endpoints and datasets on the web. Topic of the next blog post is combining data objects using the Mashup class.

On the other side, I am thinking of building on top of this infrastructure. Combining data objects with text templates similar to brew is one idea, interactive views using rpanelor similiar is another.

The datamart package is in its early stage, changes to the classes are likely to happen. I would appreciate feedback on the concept. Please leave a comment or send an email.