by Joseph Rickert
One great beauty of the R ecosystem, and perhaps the primary reason for R’s phenomenal growth, is the system for contributing new packages. This, coupled to the rock solid stability of CRAN, R’s primary package repository, gives R a great advantage. However, anyone with enough technical knowhow to formulate a proper submission can contribute a package to CRAN. Just being on CRAN is no great indicator of merit: a fact that newcomers to R, and open source, often find troubling. It takes some time and effort working with R in a disciplined way to appreciate how the organic metracracy of the package system leads to high quality, integrated software. Nevertheless, even for relative newcomers it is not difficult to discover the bedrock packages that support the growth of the R language. Those packages that reliably add value to the R language and they are readily apparent in plots of CRAN’s package dependency network.
Finding new packages that may ultimately prove to be useful is another matter. In the spirit of discovery; here are 5, relatively new packages that I think may ultimately prove to be interesting to data scientists. None of these have been on CRAN long enough to be battle tested. So please, explore them with cooperation in mind.
Cloud computing is, or will be, important to every practicing data scientist. Microsoft’s Azure ML is a particularly rich machine learning environment for R (and Python) programmers. If your are not yet an Azure user this new package goes a long way to overcoming the inertia involved in getting started. It provides functions to push R code from your local environment up to the Azure cloud and publish functions and models as web services. The vignette walks you step by step from getting a trial account and the necessary credentials to publishing your first simple examples.
Distributed computing with large data sets is always tricky, especially in environments where it is difficult or impossible to share data among collaborators. A clever partial likelihood algorithm implemented in the distcomp package (See the paper by Narasimham et al.) makes it possible to build sophisticated statistical models on unaggregated data sets. Have a look at this previous blog post for more detail.
The forests algorithm is the “go to” ensemble method for many data scientists as it consistently performs well on diverse data sets. This new variation based on performing Principal Component Analysis on random subsets of the feature space shows great promise. See the paper by Rodriguez et. al. for an explanation of how the PCA amounts to rotating the feature space and a comparison of the rotation forest algorithm with standard random forests and the Adaboost algorithm.
Given a matrix that is a superposition of a low rank component and a sparse component, rcpa uses a robust PCA method to recover these components. Netflix data scientists publicized this algorithm, which is based on a paper by Candes et al, Robust Principal Component Analysis, earlier this year when they reported spectacular success using robust PCA in an outlier detection problem.
The support vector machine is also a mainstay machine learning algorithm. SwarmSVM, which is based on a clustering approach as described in a paper by Gu and Han provides three ensemble methods for training support vector machines. The vignette that accompanies the package provides a practical introduction to the method.