by Thomas Dinsmore
On April 26, SAS published on its website an undated Technical Paper entitled Big Data Analytics: Benchmarking SAS, R and Mahout. In the paper, the authors (Allison J. Ames, Ralph Abbey and Wayne Thompson) describe a recent project to compare model quality, product completeness and ease of use for two SAS products together with open source R and Apache Mahout.
This is the second post in a two-part series. Last week, I covered simple mistakes and errors. this week's post will cover the authors' methodology and findings.
I have contacted the authors and asked if they plan to release their data to the community. To date, I have received no response.
The methodological problems in the SAS benchmark paper are too numerous to cite comprehensively, but I'll review the most glaring issues.
Choice of Products to Test: Although the authors compiled the list of algorithms to be tested through a survey of SAS users, no single SAS product supports all three. R supports all three, as does Mahout (with a workaround).
The choice of products for comparison is strangely "apples and oranges". Mahout is an early-stage analytics library; R is a mature analytic programming environment; SAS Rapid Predictive Modeler is a semi-automated ensemble modeling framework, and SAS High Performance Analytics Server is a high-end analytics platform designed for use in an MPP computing environment. Comparing the two open source products with two high-end SAS applications on “ease of use” is a straw man comparison. One would expect the two high-end SAS applications to be easier to use; that’s why SAS charges a premium for them.
Oddly enough, SAS chose to omit its own product SAS/STAT from the benchmark. As an analytic programming language, SAS/STAT offers a more apt comparison to R, and it is far more widely used than either of the two high-end SAS products.
Analytic Workflow: In the real world, data resides in external sources, and must be loaded into each of the respective platforms; analysts have choices when marshalling and loading data, and can do so in a way that best suits the application. The authors used SAS to prepare the data for use in all four products, starting with structured data sets loaded into Hadoop. This biases the "ease of use" assessment toward SAS.
Structured data stored in Hadoop is not "native" to Hadoop but structured externally and loaded into the Hadoop file system; the authors did not assess the effort needed to do this, which biases the "ease-of-use" assessment towards SAS High Performance Analytics Server, which cannot operate directly on raw data. It also biases the assessment against Mahout, since the principal advantage of Mahout is its ability to work with data stored natively in Hadoop.
The authors' decision to split the data into training and validation sets in SAS rather than using native splitting for each tool is puzzling. The stated reason for using this approach — that random seeds may differ among the tools — is fallacious; random samples of sufficient size are generalizable to one another even when produced with different random seeds.
Computing Environments: The computing environments used in this benchmark are vastly different. This is attributable in part to the widely different requirements of the products compared, but nevertheless there are some unexplained anomalies. There is no justification for using machines with very different memory profiles to host open source R and SAS Enterprise Miner; the two machines can and should be identical. By the same token, a five node Hadoop cluster is not equivalent to a sixty node MPP Greenplum appliance.
Assumptions About User Skill: The authors tested the open source tools and the SAS high-end applications using a naïve "hands-off" approach that depends on "out of the box" functionality. This is a trivial comparison that simply proves the obvious. When Deep Blue beats Kasparov at chess, it's news; when Deep Blue beats a three-year-old at chess, it's not news.
The authors propose five main findings:
(1) The types and depth of classification models that could be run in SAS products and R outnumbered the options for Mahout.
(2) The effort required by individual modelers to prepare a representative table of results was much greater in both R and Mahout than it was in SAS High-Performance Analytics Server and SAS Rapid Predictive Modeler for SAS Enterprise Miner.
(3) The object-oriented programming in R led to memory-management problems that did not occur in SAS products and Mahout; thus, the size of customer data that could be analyzed in R was considerably limited.
(4) Overall accuracy was "comparable" across software in "most instances".
(5) When the customer modeling scenarios were evaluated based on event precision and rank ordering, SAS HighPerformance Analytics Server models were more accurate.
The first two points are true but trivial. Anyone with a smattering of knowledge understands that Mahout is an early stage project with limited functionality. The two SAS tools are high-end applications with a number of automated features. That's fine, there's nothing wrong with that for users willing to pay the license fees. A reasonably competent R user can produce a comparable table of results with little effort; moreover, the table can be easily customized, which is not so easy to do with "out-of-the-box" tools.
The third point is silly. The authors attribute memory management issues encountered with R to its object-oriented nature. A more obvious explanation is that the authors deployed R on a machine with much less memory than the machine they used for SAS.
Points four and five relate to model quality, and deserve some analysis.
According to the authors, the overall misclassification rate (the inverse of overall accuracy) achieved by the four products is "comparable in most instances." They do not disclose the overall accuracy statistics, which leaves the reader in the dark about what they mean by "comparable" and which instances the overall accuracy was not comparable.
However, the SAS products produced models with higher precision and lift in the first decile. The authors choose to highlight this finding, reporting only the precision and lift statistics in the summary tables, while failing to report overall accuracy statistics.
If the models produced by SAS are more precise but not more accurate, it means that they were produced using techniques that minimize false positives at the expense of false negatives (since the overall error rate is the same). There are situations where this is a good thing — for example, in marketing campaigns where the goal is to optimize campaign ROI, the analytic goal is to target a list as precisely as possible. On the other hand, there are circumstances where the analyst must also be concerned about false negatives — for example, in a medical diagnostic situation where the cost of a false positive is "patient receives unnecessary dose" and the cost of a false negative is "patient dies."
While it is generally best to evaluate models using a range of measures and over the entire range of possible predictions (through a gains chart or ROC analysis), in the absence of information about the cost of errors, accuracy is the best single measure. And, as the authors admit, all of the products produced models of comparable accuracy.
But let's stipulate that increased precision is a good thing where models are equally accurate. How did the SAS tools produce the improved precision? As noted in the Discussion section on page 11, the SAS products accomplished this through use of well-known techniques:
- Oversampling the target event;
- Modification of model priors;
- Missing value imputation;
- Variable selection.
All of these methods are available in R; the authors simply chose not to use them. Which brings us back to a basic flaw in the authors' approach: it's news if SAS' automated tooling can beat a trained and reasonably competent analyst using R; it's not news if SAS' automated tooling can beat a passive and untrained analyst.
The reader should take note that SAS' automated tooling would also outperform a passive and untrained analyst using SAS/STAT. While all of the techniques detailed above can be implemented in SAS/STAT through the SAS Programming Language — a point recognized by the authors — they are not "automatic."
While there is a role for semi-automated model building tools in analytics, evidence from the marketplace suggests that working analysts in the real world prefer the flexibility and control offered by an analytic programming environment over "black-boxy" applications. This is true in the SAS user community as well as the broader community of analysts; users of SAS/STAT and the SAS Programming Language far outnumber users of the high end SAS applications included in this "benchmark."
Derek Norton, Andrie de Vries, Joseph Rickert, Bill Jacobs, Mario Inchiosa, Lee Edlefsen and David Smith all contributed to this post.