A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.
I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…
For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.
(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the “query source code” to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)
Pondering Leigh’s SPARQL-doc post, another post via R-bloggers, Building a package in RStudio is actually very easy (which describes how to package a set of R files for distribution via github), asdfree (analyze survey data for free), a site that “announces obsessively-detailed instructions to analyze us government survey data with free tools” (and which includes R bundles to get you started quickly…), the resource listing Documentation for package ‘datasets’ version 2.15.2 that describes a bundled package of datasets for R and the Linked Data API, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:
How about developing and sharing commented query libraries around Linked Data endpoints that could be used in arbitrary Linked Data clients?
(By “Linked Data clients”, I mean different user agent contexts. So for example, calling a query from Python, or R, or Google Spreadsheets.) That’s it… Simple.
One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and “compile” a programme library/set of “API” calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).
(This reminds me of a Twitter exchange with Nick [email protected] a couple of days ago around “self-assembling” API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. Swagger (presentation), which I haven’t had time to look at yet.)
The idea, then is this:
- Define a simple file format for declaring documented SPARQL queries
- Define a simple packaging format for bundling separate SPARQL queries
- The simply packaged set of queries define a simple “raw query” API over a Linked Data dataset
- Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.
So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?
[See also: @ldodds again, on Publishing SPARQL queries and-documentation using github]