When I joined Joyent last year, I jumped on the opportunity to make R work with Joyent Manta. If you are new to Joyent, we are the High-Performance Cloud Infrastructure Company. If you are new to data analytics, the R language is a rich open-source environment for data exploration and complex analytics.
Joyent Manta is a durable cloud object store with a built-in high-performance compute engine. The power of Manta rests in its ability to compute on data without moving it from cloud object storage to compute nodes.
For R data exploration, I knew Manta was a game-changer.
Manta provides the big transient compute cluster which, as a scientist, I always craved. No cluster setup time deficit, no cluster maintenance struggle, and no wait time needed move data from storage to compute nodes. Best of all, I could make Manta look like an R user would expect. Ready for data exploration. Ready for scale up.
Don't Ya Know That Spheres Are Not Enough?
Like many scientists I have used R (and way back to its predecessor S) for data exploration and complex analytics. Two years ago I was chewing on a classic 'spherical cow' problem: Trying to cluster 3D objects (protein molecules) based on the smallest sphere radius, r, that contained them. Eventually, I found some fancier math (principal components analysis) to compute their ellipsoidal container axes: a, b and c from the enveloping wireframe coordinates.
Sure enough the blimpy-shaped prolate ellipsoids worked much better, as you might expect they would for the spherical cow.
But there was a computational complexity catch: performance. I hacked C code, and pushed single threaded R to the breaking point. There were, of course, ways to parallelize the R part of the analysis:
I could have spent a few weeks setting up a compute cluster and installing software.
I could have lost touch with the math, C and R code for a few weeks and re-loaded it in my brain afterward.
I could have undertaken the longer-term maintenance time suck of keeping the cluster beast running, back it up and so on.
But at the time I didn't. It wasn't worth my time. So I managed with the single-threaded R performance and waiting for results, which fortunately were small-ish (on the order of weeks).
Manta The R Way.
Aside from the obvious open source benefits, pretty graphs and the usual things you can do in almost any language, the R command prompt has a long history as a workhorse REPL, (Read - Evaluate - Print Loop).
For R users, its REPL fulfills the need to explore data interactively without leaving the R prompt. It retains memory state, saving and loading your data objects from session to session. And R package management just works, on Linux, Unix, and Windows.
For many, the R command line may be the only command line they know.
So, rather than simply port Manta commands from Python or Node.js to R, I undertook to re-imagine the Manta interface as R functions which:
- remember your current working Manta subdirectory, saving you from absolute path typing
- support native transfer of R memory objects, R text files, and the R workspace itself
- integrate R pattern matching and sorting for hierarchical filesystem operations
- provide for easy interactivity, but have powerful optional parameters for scripting
- support multiple Manta accounts
- come with full, self-contained documentation like any good R package does
- log HTTP transactions in Trent Mick's bunyan JSON format and provide HTTP traceback
- support vectorized and recursive filesystem operations where appropriate
- provide an audit trail of R workspaces for each machine on which you run the client
- and the biggie - support UNIX Map/Reduce without the compute cluster setup penalty
The GitHub repo for mantaRSDK is ready at v 0.8.0 for R-3.0.0 and up client use on Unix, Linux and Windows, with installation instructions on the GitHub README.md page. No Java or Scala, or PuTTY or even Node.js required (although the latter helps!). Straight R code, with some authentication help from OpenSSL.
And I have a webinar scheduled for January 23 at 10:00am PST. Signup here.
For a quick overview of the R package, the scope of it is like bringing a whole operating system into your R environment:
Object Store Operations
mantaExists() mantaPut() mantaGet() mantaCat() mantaRm() mantaSnapln() mantaDump() mantaSource() mantaSave() mantaLoad() mantaSave.ws() mantaLoad.ws()
Hierarchical Directory Operations
mantaGetwd() mantaSetwd() mantaSetwd.jobs() mantaSetwd.public() mantaSetwd.reports() mantaSetwd.stor() mantaSetwd.ws() mantaMkdir() mantaRmdir() mantaLs() mantaLs.du() mantaLs.l() mantaLs.n() mantaLs.paths() mantaLs.url() mantaFind() mantaFind.du() mantaFind.l() mantaFind.n() mantaFind.sizepath() mantaFind.sizes() mantaFind.url()
Compute Job Operations
mantaJob.setup() mantaMap() mantaReduce() mantaJob.launch() mantaJob.status() mantaJob.done() mantaJob.cancel() mantaJob.errors() mantaJob.errors.stderr() mantaJob.failures() mantaJob.inputs() mantaJob.outputs() mantaJob.outputs.cat() mantaJobs() mantaJobs.running() mantaJobs.tail()
mantaAccount() mantaWhoami() mantaGetLimits() mantaSetLimits()
The Final Antagonist, Slain.
Moving from R on your notebook to R with Map/Reduce capabilities, if you follow Big Data dogma, means you are going to need to invest substantial effort provisioning, installing, configuring and maintaining a Hadoop or Spark cluster system on your own hardware or in the cloud.
In the time it takes to read the Hadoop or Spark cluster requirements documentation, you are up and running Map/Reduce jobs on Manta.
After you get the client installed and working, start with the example code in the R help and try them out in this order to get an overview of the full capabilities:
?mantaSetwd ?mantaMkdir ?mantaPut ?mantaGet ?mantaSave.ws ?mantaJob.launch
That last example, by the way, runs two compute jobs. First it moves a copy of Shakespeare's works into your storage, then it performs a UNIX word count Map/Reduce on all of Shakespeare's writing, on Manta's compute nodes directly. That is exactly what I mean by no compute node setup. But this simplification doesn't mean you are getting a system that is confining. Manta opens up Map/Reduce to span the entire UNIX command environment. The Manta Service allows you to specify shell scripts, runtime languages, and compiled executables for Map and Reduce tasks.
The Free Manta storage tier is good for 10GB of cloud storage for a year, and $130 usage credit for compute time, which is billed by the second. The Free Trial link is at the top right side of the blog, so give it a spin.
Over the next few weeks I will post more blogs with Map/Reduce examples including image analysis and other UNIX compute tasks, and some insight on what it took to build this package.
The current release of mantaRSDK is pegged at 0.8, as we have a bit more plumbing work to do to on the back end to enable native R functions for Map/Reduce. Rest assured, that is coming up after a few more compute node image iterations. Once that is in place we will increment the mantaRSDK client version, move the repo over to Joyent's GitHub and let you know how to work Manta Map/Reduce compute jobs directly with native R functions.