As mentioned in the previous article, a possibility for dealing with some Big Data problems is to integrate R within the Hadoop ecosystem. Therefore, it's necessary to have a bridge between the two environments. It means that R should be capable of handling data the are stored through the Hadoop Distributed File System (HDFS). In order to process the distributed data, all the algorithms must follow the MapReduce model. This allows to handle the data and to parallelize the jobs. Another requirement is to have an unique analysis procedure, so there must be a connection between in-memory and HDFS places.
A possibility for dealing with those problems is to use “rmr2” package provided by Revolution Analytics. It allows to write MapReduce jobs within the R environment and to copy data to and from Hadoop. The main function is “mapreduce” and these are its main arguments.
mapreduce(input, output, input.format, output.format, map, reduce)
The input and output arguments are the paths of input and output HDFS data. They are strings such as '/data/test/input.csv' and '/data/test/output.csv'. The input.format and output.format arguments are the extension of the files, for instance csv. I will talk about them later in this article.
The Map and Reduce arguments are the two steps of the MapReduce algorithms written through R code. As mentioned in the previous article, HDFS and MapReduce follow the key/value pairs logic. For this reason, both the input and the output of any MapReduce step must have that structure. Rmr2 provides “keyval” function that generates a key/value pair list starting from a matrix and an array. This function must be used for the output.
Regarding the HDFS files, rmr2 supports different formats such as “csv” and “json”. However, in order to avoid any drawback, the best choice is to convert all the files into the “native” format. For instance, the csv format does not have any key, so it uses the first column of the table as the key and it's necessary to be careful in writing the algorithms.
Regarding in-memory data, rmr2 automatically sends all of them along with the job instructions, so it's possible to use them during the Map or Reduce tasks if it's necessary.
The connection between in-memory and HDFS data is possible thanks to two other functions that are from.dfs and to.dfs. However, they're are useful only when the file is small, since it has to occupy the RAM. For example, the final output of a model may be a small set of data and it may be loaded.
Rmr2 fits very well some Big Data problems since it makes possible to develop some custom algorithms. In this way, it's possible to develop some solutions that fit the problem as much as possible. The disadvantage is the cost of writing new algorithms, since there isn't any in the MapReduce form. Anyway, it's possible to integrate other tools too and to use in-memory R algorithms.
In the next article I will present some examples in order to make everything more concrete. Stay tuned.