I was genuinely chuffed to get a shout-out in the most recent episode of Not So Standard Deviations, the awesome statistics-and-R themed podcast hosted by Hilary Parker and Roger Peng. In that episode, Roger recounts his recent discovery of the Microsoft ecosystem of tools for R, which he (jokingly) dubbed the “Microsoft-verse”.
While we're flattered by the allusion to the tidyverse, in general Microsoft's developments with R are designed to work with the entire R ecosystem rather than be distinct from it. Here's a quick overview of what Microsoft has developed around R. It's in three sections: the first two don't require any special version of R, and only the third section requires a Microsoft-specific R distribution.
Thanks again to Hilary and Roger for another entertaining episode of NSS Deviations and for giving me the impetus to write all of this down. (This started as an email, but I quickly realized it was getting too long and became this blog post instead.) If you have any questions or feedback, let me know in the comments section of this post.
R available from within Microsoft products
You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you've written) to the data they contain.
- SQL Server (the database) allows you to call R from SQL, or publish R functions to a SQL Server for database adminstrators to use from SQL.
- Power BI (the reporting and visualization tool) allows you to call R functions to process data, create graphics, or apply statistical models to data.
- Visual Studio (the integrated development environment) includes R as a fully-supported language with syntax highlighting, debugging, etc.
- R is supported in various cloud-based services in Azure, including the Data Science Virtual Machine and Azure Machine Learning Studio. You can also publish R functions to Azure with the AzureML package, and then call those R functions from applications like Excel or apps you write yourself.
Open source R tools and packages from Microsoft
Microsoft provides various open source tools to help people use R. This includes R packages published on CRAN and in Github.
- Microsoft R Open, Microsoft's distribution of open-source R. The only difference with CRAN R is that comes bundled with Intel Math Kernel Libraries (which makes vector and matrix operations faster on multi-core machines), and that it uses a static mirror of CRAN so packages don’t change from day to day (but you can always use the “checkpoint” package described below to get the latest-and-greatest, if you want).
- MRAN, which is the download repository for Microsoft R Open, and also hosts daily archive snapshots of the entire CRAN system (from 2015 to the present). These snapshots are used for reproducibility by Microsoft R Open, the checkpoint package (see below), and anyone who wants a non-changing CRAN image. (The Rocker docker images are configured to use these static snapshots, for example.)
- The checkpoint package, which provides a simple interface to those static CRAN snapshots, for reproducibility. (In short: add
checkpoint("2018-02-13")to make R install and use packages from that date for your project, now and in the future, including when you share scripts with someone else.)
- The foreach and iterators packages, for parallel programming. Microsoft also provide “backends” to use different parallel programming systems for a
foreachloop, like doParallel (use a local machine or cluster), and doAzureParallel (spin up a cluster in Azure and to the parallel iterations there).
Parallel and distributed algorithm implementations of statistics and machine learning algorithms
Microsoft has implemented a suite of algorithms for statistics and machine learning. These either serve as replacements for existing R functions or packages, or add new capabilities. They are designed for performance and to work without data size limitations. (In general, these algorithms are closed-source and only available within Microsoft R products, and not on CRAN or Github.)
- The RevoScaleR package provides new implementations of some of R's statistical functions (for example:
rxGlmis the equivalent of R's
glm), but are designed to work with data sizes much larger than available memory. It also uses parallel computing to speed things up when running on a multi-core server, in a Hadoop or Spark cluster, or in a SQL Server database.
- The MicrosoftML package provides high-performance implementations of various machine-learning algorithms (like neural networks and random forests), and also some pre-trained machine learning models like sentiment analysis for text.
- Those two packages can optionally make use of the XDF file format, a binary on-disk data format designed for performance and parallel processing. Within R, XDF objects behave much like data frames, and you can also apply tidyverse data functions with the dplyrXdf package.
- The mrsdeploy package allows you to publish custom R scripts and functions (including ones using the packages above) to a server as an API that can be called from other applications.
- Microsoft Machine Learning Server includes R and the packages above. So do SQL Server, and HDInsight (the big-data platform on Azure). Microsoft also provides paid support for R to customers of these products.
- Microsoft R Client also includes those packages. It's free to use for developing applications, but it's performance-limited to non-production workloads.