Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package,sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files,words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


Here is our code:


# give each node in the cluster cls an ID number
assignids <- function(cls) {
clusterApply(cls,1:length(cls),
function(i) myid <<- i)
}

# each node executes this function
getwords <- function(basename) {
fname <- paste(basename,".",myid,sep="")
words <- scan(fname,what="")
length(words)
}

# manager
wordcount <- function(cls,basename) {
assignids(cls)
clusterExport(cls,"getwords")
counts <- clusterCall(cls,getwords,basename)
sum(unlist(counts))
}

# call example:
> library(parallel)
> c2 <- makeCluster(2)
> wordcount(c2,"words")
[1] 83



This couldn’t be simpler.  Yet it does what we want:

• parallel computation on chunks of a distributed file, on independently-running nodes
• automated “caching” (use the R <<- operator with the output ofscan() above)
• no configuration or platform worries
• ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?