"The R-Files" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.
Name: Saptarshi Guha
Background: Ph.D. in Statistics, Purdue University
Years Using R: 6
Known for: Developing RHIPE package for R + Hadoop integration
At just 31 years old, Saptarshi Guha has emerged as a cutting-edge contributor to the R community. Saptarshi earned a Ph.D. in Statistics from Purdue University in June 2010 (his advisor was William S. Cleveland, one of the pioneers of modern data visualization) and his current research focus is the analysis of large data sets. He is working to develop innovative approaches to the visualization and computing of statistical analyses. He has also worked on modeling network traffic for security and developing algorithms to detect human presence in SSH connections.
Saptarshi is best known for developing the popular RHIPE package that integrates the R statistical environment with the Hadoop framework. RHIPE allows R users to compute on terabyte-sized data sets a cluster using the MapReduce framework, thus offering the best of both worlds to users seeking to leverage the strength of R and Hadoop. People with very large data sets stored in the Hadoop Distributed File System can now easily process the data on hundreds or even thousands of nodes in parallel, using only the R language (no need to learn Java). They can even apply the statistical algorithms in R to boot.
While Saptarshi has studied the intersection of computer science and statistics for well over a decade — in the U.S. as well as his native India — it was not until he began his doctoral research at Purdue that he learned R. Given his statistical background, he quickly took to the language. "R is one of the most expressive languages I’ve ever encountered, and it’s perfectly geared towards comprehensive data analysis," he says.
In addition to RHIPE, Saptarshi has worked on packages that serialize R objects for operability with other languages, including Python, Java and C. He also wrote a package that saves R objects in a flexible data format so that individual objects can be lazy loaded. He has used R to perform statistical analysis on a wide array of topics, from network security modeling to generating reports and graphs for monthly expenditures and weight loss programs.
“The real beauty with R is that it’s constantly evolving,” Saptarshi says. “Is it perfect? No. But it’s being constantly refined by some of the most brilliant statistical minds today.” To that end, Saptarshi is working on several packages that will bring features of distributed computing to users working within the R environment. He’s also working on a package that integrates R with HBase and give users a fast query distributed data store that applies MapReduce computations across the data using RHIPE.
On October 12, Saptarshi will be delivering a talk at Hadoop World, where he’ll demonstrate the use of RHIPE to analyze 190 Gb of VoIP network data. (This project was joint work with Jin Xia and William S. Cleveland.) When you make a call over Skype, Google Voice, or even a landline routed over the internet, the call quality is of primary importance. One of the major factors influencing call quality is timing: when you speak, your voice is sampled every 20 milliseconds, but on the receiver’s end it might not arrive quite so regularly – perhaps 5ms too early or too late.
This “jitter” in packet arrival times degrades the audio quality; to investigate this, Saptarshi used R code to identify which packets corresponded to a single call. He used R’s robust regression algorithm to remove the effect of the gateway. In this way, he was able to process the data in a matter of minutes stored across a cluster of eightcomputers to assess the overall call quality metrics of the system.