My new release of partools is now on CRAN.
The package is aimed at doing parallel data science in what I call an “un-MapReduce” manner. It takes the point of view that MapReduce-based frameworks such as Hadoop and Spark are fine for the types of applications their designers had in mind, namely rather simple SQL actions, but have fundamental handicaps that prevent them from performing well on many, if not most, of the types of computation that typical users need for large data sets and/or highly compute-bound applications. The distributed file/object nature of those MapReduce systems is retained, but the confining MapReduce computational paradigm is avoided.
The package now contains about 30 functions, ranging from infrastructure support to summary and aggregation to statistical/machine learning applications. See the vignette for a fairly detailed introduction. Two new capabilities that I wish to highlight are:
- Aggregation and related operations on objects of class “data.table”.
- Parallel computation for some modern statistical/machine learning algorithms (they are statistics to me, but you may call them machine learning if you prefer).
The core of that second highlighted set of functions makes use of what I call Software Alchemy, which I have explained in previous blog posts. See for instance the example on random forests in the vignette.