Parallel Computing for Data Science

[This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hot off the press, Norman Matloff’s book, Parallel Computing for Data Science: With Examples in R, C++ and CUDA  (Chapman and Hall/ CRC Press, 2015) should appeal to a lot of the readers of this blog.

The book’s coverage is clear from the following chapter titles:

1. Introduction to Parallel Processing in R
2. Performance Issues: General
3. Principles of Parallel Loop Scheduling
4. The Message Passing Paradigm
5. The Shared Memory Paradigm
6. Parallelism through Accelerator Chips
7. An Inherently Statistical Approach to Parallelization: Subset Methods
8. Distributed Computation
9. Parallel Sorting, Filtering and Prefix Scan
10. Parallel Linear Algebra
Appendix – Review of Matrix Algebra 

The Preface makes it perfectly clear what this book is intended to be, and what it is not intended to be. Consider these passages:

“Unlike almost every other book I’m aware of on parallel computing, you will not find a single example here dealing with solving partial differential equations and other applications of physics………” 
That pretty much says it all!

“While the book is chock full of examples, it aims to emphasize general principles. Accordingly, after presenting an introductory code example in Chapter 1 (general principles are meaningless without real examples to tie them to), I devote Chapter 2 not so much as how to write parallel code, as to explaining what the general factors are that can rob a parallel program of speed. Indeed, one can regard the entire book as addressing the plight of the poor guy described at the beginning of Chapter 2: 
Here is an all-too-common scenario:
 An analyst acquires a brand new multicore machine, capable of wondrous things. With great excitement, he codes up his favorite large problem on the new machine—only to find that the parallel version runs more slowly than the serial one. What a disappointment! Let’s see what factors can lead to such a situation… 
One thing this book is not, is a user manual. Though it uses specific tools throughout, such as R’s parallel and Rmpi libraries, OpenMP, CUDA and so on, this is for the sake of concreteness. The book will give the reader a solid introduction to these tools, but is not a compendium of all the different function arguments, environment options and so on. The intent is that the reader, upon completing this book, will be well-poised to learn more about these tools, and most importantly, to write effective parallel code in various other languages, be it Python, Julia or whatever.”
From my reading of the book, Matloff achieves his goals, and in doing so he has provided a volume that will be immensely useful to a very wide audience. I can see it being used as a reference by data analysts, statisticians, engineers, econometricians, biometricians, etc. This would apply to both established researchers, and graduate students. This book provides exactly the sort of information that this audience is looking for, and it is presented in a very accessible and friendly manner.

© 2015, David E. Giles

To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)