Michael Kellen, Director of Technology at Sage Bionetworks, is trying to build a GitHub for science. It’s called Synapse and Kellen described it in a talk at the Sage Bionetworks Commons Congress 2012, this past weekend: ‘Synapse’ Pilot for Building an ‘Information Commons’.
To paraphrase a Kellen’s intro:
Science works better when people build off of each other’s works. Every great advance is preceded by a multitude of smaller advances. It’s no accident that the invention of the printing press and the emergence of the first scientific journals coincide with the many great scientific discoveries of the age of enlightenment. But scientific journals are stuck in a paradigm revolving around the printing press. In other domains, namely open source software, people are more radically reinventing systems for sharing information with each other. Github is a collaborative environment for the domain of software. Synapse aims to be a similar environment for medical and genomic research.
The Synapse concept of a project packages together data and the code to process it. I tried to download the R script shown in the contents and couldn’t, either because I’m a knucklehead or because Synapse is a work in progress. On the plus side, they give you a helpful cut-n-paste snippet of R code in the lower right corner to access the project through their R API. When this is fully implemented, it could provide a key piece of computing infrastructure for reproducible data-driven science.
Sage intends to explore ways of connecting to traditional scientific journals. Picture figures that link to interactive visualizations or computational methods that link to code. I’m a big fan of the “live document” concept and it would be great to see journal articles evolve in that direction.
An unintended consequence of NGS, Robert Gentleman points out, is that the data is too big for existing pipes. Any concept of a GitHub for science will have to incorporate processing biological data in the cloud. I could imagine a Synapse project containing data sets, code and a recipe for standing up an EC2 instance (or several). At a click, a scripted process would run, bootstrapping the machines, installing software and dependencies, running a processing pipeline, and visualizing the results in a browser. How would that be for reproducible science?