Google BigQuery and the Github Data Challenge

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Github has made data on its code repositories, developer updates, forks etc. from the public GitHub timeline available for analysis, and is offering prizes for the most interesting visualization of the data. Sounds like a great challenge for R programmers! The R language is currently the 26th most popular on GitHub (up from #29 in December), and it would be interesting to visualize the usage of R compared to other languages, for example. The deadline for submissions to the contest is May 21.

Interestingly, GitHub has made this data available on the Google BigQuery service, which is available to the public today. BigQuery was free to use while it was in beta test, but Google is now charging for storage of the data: $0.12 per gigabyte per month, up to $240/month (the service is limited to 2TB of storage – although there a Premier offering that supports larger data sizes … at a price to be negotiated). While members of the public can run SQL-like queries on the GitHub data for free, Google is charging subscribers to the service 3.5 cents per Gb processed in the query: this is measured by the source data accessed (although columns of data not referenced aren't counted); the size of the result set doesn't matter.

With analysis limited to simple queries — “select” statements and by-row aggregations and the like — it's hard to see how this will have a big impact on companies needing to do even moderately advanced analytics on their data. It may prove useful as a store for relatively static data (like the GitHub example above), but given it takes 20 minutes to transfer 200Gb of data into BigQuery it doesn't seem well suited for frequently-changing data. Unlike Amazon S3 though, Google doesn't charge for data transfer in and out of the cloud — but at least with Amazon you can transfer data for free to Amazon AWS. And with Amazon AWS, you have a limitless range of AMIs (machine images) at your disposal, and can use advanced analytic tools like Revolution R Enterprise for predictive modeling and the like. If you're already storing your data in the cloud (as more companies are doing — Netflix is great example), this makes Amazon a more compelling cloud platform for Big Data Analytics. But if Google opens up more flexible data analysis options (or even the Google Prediction API) to BigQuery, this might change.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)