As I mentioned recently, the new, greatly extended version of my partools package is now on CRAN. (The current version on CRAN is 1.1.3, whereas at the time of my previous announcement it was only 1.1.1. Note that Unix is NOT required.)
It is my contention that for most R users who work with large data, partools — or methods like it — is a better, simpler, far more convenient approach than Hadoop and Spark. If you are an R user and, like most Hadoop/Spark users, don’t have a mega cluster (thousands of nodes), partools is a sensible alternative to Hadoop and Spark.
I’ll introduce partools usage in this post. I encourage comments (pro or con, here or in private). In particular, for those of you attending the JSM next week, I’d be happy to discuss the package in person, and hear your comments, feature requests and so on.
Why do I refer to partools as “sensible”? Consider:
- Hadoop and Spark are quite difficult to install and configure, especially for non-computer systems experts. By contrast, partools just uses ordinary R; there is nothing to set up.
- Spark, currently much favored by many over Hadoop, involves new, complex and abstract programming paradigms, even under the R interface, SparkR. By contrast, again, partools just uses ordinary R.
- Hadoop and Spark, especially the latter, have excellent fault tolerance features. If you have a cluster consisting of thousands of nodes, the possibility of disk failure must be considered. But otherwise, the fault tolerance of Hadoop and Spark are just slowing down your computation, often radically so. (You could also do your own fault tolerance, ranging from simple backup to sophisticated systems such as Xtreemfs.)
What Hadoop and Spark get right is to base computation on distributed files. Instead of storing data in a monolithic file x, it is stored in chunks, say x.01, x.02,…, which can greatly reduce network overhead in the computation. The partools package also adopts this philosophy.
Overview of partools:
- There is no “magic.” The package merely consists of short, simple uitiliies that make use of R’s parallel package.
- The key philosophy is Keep It Distributed (KID). Under KID, one does many distributed operations,, with a collective operation being doing occasionally, when needed.
Sample partools (PT) session (see package vignette for details, including code, output):
- 16-core machine.
- Flight delay data, 2008. Distributed file created previously from monolithic one via PT’s filesplit().
- Called PT’s fileread(), causing each cluster node to read its chunk of the big file.
- Called PT’s distribagg() to find max values of DepDelay, ArrDelay, Airtime. 15.952 seconds, vs. 249.634 for R’s serial aggregate().
- Interested in Sunday evening flights. Each node performs that filtering op, assigning to data frame sundayeve. Note that that is a distributed data frame, in keeping with KID.
- Continue with KID, but if later we want to un-distribute that data frame, we could call PT’s distribgetrows().
- Performed a linear regression analysis, predicting ArrDelay from DepDelay and Distance, using Software Alchemy, via PT’s calm() function. Took 18.396 seconds, vs. 76.225 for ordinary lm(). (See my new book, Parallel Computation for Data Science, for details on Software Alchemy.)
- Did a distributed na.omit() to each chunk, using parallel‘s clusterEvalQ(). Took 2.352 seconds, compared to 9.907 it would have needed if not distributed.
- Performed PCA. Took 8.949 seconds for PT’s caprcomp(), vs. 58.444 for the non-distributed case.
- Calculated interquartile range for each of 12 variables, taking 2.587 seconds, compared to 29.584 for the non-distributed case.
- Performed a more elaborate distributed na.omit(), in time 9.293, compared to 55.032 in the serial case.
Again, see the vignette for details on the above, on how to deal with files that don’t fit into memory etc.