**Memo's Island**, and kindly contributed to R-bloggers)

In many data science applications and in academic research, techniques involving Bayesian Inference is now used commonly. One of the basic operation in Bayesian Inference techniques is drawing instances from given statistical distribution. This of course well known pseudo-random number sampling. Most commonly used methods first generates uniform random number and mapping that into distribution of interest via cumulative function (CDF) of it, i.e., Box-Mueller transform.

Large scale simulation are now possible, due to highly stable computational frameworks that can scale well. One of the unique framework is Apache Spark due to its distributed data structure supporting fault tolerance, called Resilient Distributed Data (RDD). Here is a simple way to generate one million Gaussian Random numbers and generating an RDD:

1 |
// Generate 1 million Gaussian random numbers |

One unrealistic part of the above code example is that you may want to generate huge number of samples that won’t fit in single memory, *ngauss* variable above. Luckily, there are set of library functions one can use to generate random data as an RDD from mllib, see randomRDD. But for the remainder of this post, we will use our home made random RDD.

Figure: Scaling of execution time with increasing size, with or without re-partitioning. |

**Concept of Partitions**

As RDDs are distributed data structures, the concept of partition comes into play (link).. So, you need to be careful of the size of partitions in RDDs. Recently I posed a question about this in Apache Spark mailing list (link)(gist). If you reduce the data size, take good care that your partition size reflects this, so to speak avoiding huge performance reduction. Unfortunately, Spark does not provide an automated out of box solution optimising partition size. The actual data items that might reduce during your analysis pipeline. A reduced RDD will inherit partition size of its parent and this may be a limiting issue.

As you might have already guessed, RDDs are great tool in doing large scale analysis but they won’t provide you a free lunch. Let’s do a small experiment.

**Hands on Experiment**

Going back to our original problem of using Spark in Bayesian inference algorithms, it is common to operate on samples via certain procedure. And those procedures, let’s say an operator, highly likely that it will reduce the number of elements in the sample. One example would be applying a cut-off or a constrained in the CDF, which essential the definition of it, probability of random variable $x > x_{0}$. As seen in Figure, we have generated random RDDs up to 10 million numbers and measure the wall-clock time of cut-off operation, simply a *filter* operation. See codes in the Appendix. As a result, in Figure, we have identified 3 different regions, depending on data size,

- Small Data: Re-partitioning does not play a role in re-partitioning.
- Medium Size: Re-partitioning gives up to order of magnitude better performance.
- Big Data: Re-partitioning gives a constant performance improvement, up to 3 times better, and the improvement is drifting, meaning it will be more significant larger the data size.

**Conclusion**

Spark provides a superb API to develop high quality Data Science solutions. However, programming with Spark and designing algorithms requires optimisation of different aspects of the RDD workflow. Here, we only demonstrate only dramatic effect of re-partitioning after a simple operation in the walk clock time. Hence, it is advised to have a benchmark identifying under which circumstances your data pipeline produce different wall clock behaviour before going into production.

**Appendix**

Entire code base can be cloned from github (here).

Spark Benchmark

1 |
/* |

Plotting code

1 |
# |

**leave a comment**for the author, please follow the link and comment on their blog:

**Memo's Island**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...