R Tackles Big Garbage

April 1, 2013
By

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.

While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.

While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”

Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”

The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”


To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.