Teaser: Running R as a map/reduce job from Riak

August 17, 2011
By

(This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers)

Alliterations aside, here is a preview of something I’ve been tinkering with. My goal is to be able to run R code as a phase within a Riak map/reduce job. In a multi-cultural world filled with distinct languages, it should be obvious that one size does not fit all. In the case of erlang, statistics is not its strong suit. Writing a sparse matrix class is bad enough, but imagine implementing regression or random matrix theory. For its part and despite many honorable attempts, R isn’t great at distributed processing. So waving the banner of bringing the processing to the data, why not use R to process portions of a map/reduce job?

This actually isn’t as hard as it sounds. Below are a few snippets of running R code via an erlang RPC. This means that R is available and running as an erlang node!

First, we are calling the R function ‘mean’to calculate the arithmetic mean of the list of numbers

```<pre>([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */)57> rpc:call([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */', rchimedes, eval, {mean, [[10,12,13,25,20]]}).
{ok,{16.0}}</pre>
```

Next we’ll get samples from a random normal distribution. To me, calling rnorm is analogous to Hello, World for R.

```([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */)58> rpc:call([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */', rchimedes, eval, {rnorm, [10]}).
{ok,{-1.3440940467953522,1.0346333094171907,
-2.7704297093573698,0.32721935800723084,1.6406162089066918,
-0.480623709693892,-1.4687159958435285,-0.4415948361775166,
-1.2729869815762578,0.8369905573667532}}
```

Currently the syntax is structured to use atoms as function references (i.e. the function must exist in R space) and binary strings as function defintions. Notice that the arguments passed to the function are sent in a list. This is standard erlang to support additional arguments for the remote function call. For example, lets say we want to pull from a normal distribution with mean 5:

```
([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */)60> rpc:call([email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */', rchimedes, eval, {rnorm, [10,5]}).
{ok,{4.939374253203547,5.2481766179207545,6.413720221228998,
5.679098487985773,6.371656468561924,5.572533109697437,
4.196247547549403,5.36443397342678,3.7423040151803044,
6.979719956460093}}

```

The above examples hopefully whet your appetite for what is possible here. The next step in the exercise is to execute from a Riak job and pull it all together in a complete job. Any ideas on case studies are welcome. Otherwise, brace yourself for something finance related.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: , , ,