By Andy Nicholls, Head of Consulting (UK)
Why do it?
Mango has been involved in an increasing number of engagements where customers are seeking to migrate from SAS to R. There are a number of different business drivers for these migrations but general areas of interest are as follows:-:
Graduates –Most universities have adopted R as their primary analytical programming language for mathematics and statistics courses. At Mango we also see R being used in departments such as biology and social science. Companies recruiting a new graduate can expect them to be highly skilled in R and keen to deploy this skill.
Cutting Edge Analytics – Since academia has taken to R many new algorithms subsequently adopted by industry are produced in R. Businesses looking to get one step ahead of their competition by utilising the latest and greatest algorithms don’t have the time to wait for the next SAS release.
Cost – Many people tend to assume that this would always be the primary driver though it is surprising how often this is not the case. However IT budgets have come under increasing pressure over the last few years and software licensing is an obvious area where cost reductions can be achieved.
Support – One of the key factors in deploying enterprise analytics is the provision of support and assistance. Typically “free” products such as R come with no guarantees or inherent support network. However today there are a range of support options available that allow analytics teams to deploy R in production environments with a sense of safety and comfort that arguably allows a more flexible support environment than SAS.
Changing Nature of Analytics – The analytic landscape has altered considerably in recent years and more complex algorithms are being used more often and by more people. One factor feeding into this is the ability for teams to access very large data sources across different areas of the organisation. Increasingly technologies such as Hadoop are utilised to access many different types of data, creating fuller, richer and faster analytics. Database vendors are also embedding R into their offering and providing analytics within their environments increasing access and flexibility of their offerings. Because R is free and open it can be used in all of these environments easily and quickly
SAS teams have typically invested years of effort building large libraries of SAS macros and code. The move to R is not something which happens overnight. In choosing a strategic move away from SAS companies must consider a variety of factors including re-training and crucially a migration of what is typically a very large code base developed over many years by a variety of analysts. So how do you migrate 20 years’ worth of SAS code to R? The approach Mango take is outlined in the following section.
Migrating SAS code to R can be laborious but when carried out sensibly and efficiently there is no reason why business activities should be adversely affected. For Mango, the following basic approach has proved successful time and again with both large and small customers.
- Assessment of analytical routines
- Migration infrastructure
- Development of unit test structure
- Development of standardised R functions
This approach is supplemented with version control repository and continuous integration.
Assessment of Analytical Routines
SAS and R are fundamentally very different languages. This is particularly apparent when it comes to statistical routines. It should come as no surprise to any statistician, programmer or data scientist that SAS and R do not always give the same answer to a given question. It would be easy and overly simplistic to assume that all we need to do is agree some level of precision. “As long as it’s the same to X decimal places then we’re happy”! However it is rarely that simple. This level of precision might be acceptable for parameter effects but when it comes to p-values, the expectations can be quite different. Particularly when SAS generates a p-value of 0.051 and R generates 0.049! Expectations need to be clearly defined prior to the migration!
At Mango we begin all SAS to R migration engagements with a discussion about the statistical routines used and an impact analysis. The result of these discussions is a list of success criteria and corresponding degrees of tolerance around those areas which we know SAS and R will provide different results.
Once the scope of work has been defined and agreed, the first step when performing a SAS to R code migration (or any code migration) is to baseline the existing code by placing this within a version controlled repository. The test framework and migrated R code described in the following sections are also placed within this repository. Our continuous integration server, Jenkins, is set up to poll and re-run the tests on the migrated code following code commits. We use an in-house reporting tool that we developed for Jenkins which integrates testthat output with Jenkins’ XML report structure. As the project progresses this reporting framework allows us to track and feedback migration progress.
Development of Unit Test Structure
At Mango we utilise a unit test based approach to code migration. This can take time to set up initially but offers a more complete environment and ultimately leads to more successful migrations. Mango’s unit-testing framework has been built up over many migrations. We have seen common SAS procedures (UNIVARIATE, MIXED, GENMOD, etc.) appear time and again within our customers’ centralised macro libraries. Our unit test framework focusses on elements such as degrees of freedom and p-values. It is important that any framework and tests are relevant. At Mango we supplement our own test cases with customer data, adding additional unit tests as necessary.
Development of Standardised R Functions
Once the success criteria have been defined and a unit-testing framework has been developed, the next task is to migrate the centralised SAS macro libraries. Once again, the initial set-up can take some time and experience of code migration goes a long way.
The biggest mistake that companies can make in a SAS to R migration is to try to perform a ‘like-for-like’ migration; migrating SAS code PROC by PROC into R functions. We often see this when SAS to R migrations are conducted by those without extensive knowledge of both languages. Whilst the like-for-like approach facilitates a ‘successful’ migration in that each PROC has a corresponding R function to compare to via unit tests, it misses fundamental differences between the structures of the two languages. There is a good reason why a data frame is not the only type of data structure in R! Consider a simple merge of two datasets in SAS. The nature of SAS dictates that this consists of two calls to PROC SORT followed by a DATA STEP containing a MERGE statement. However in R sorting the data first is an unnecessary step. A like-for-like migration has therefore generated three function calls where one would have sufficed. Like-for-like conversions ignore fundamental differences between the languages and usually result in verbose, inefficient R code which is extremely difficult to manage in the long term.
The migration of centralised reporting macros presents an opportunity to improve and streamline existing functionality. By definition, improvement of functionality also means change in functionality. This presents a challenge with respect to testing, since we do not have like for like output for comparison. However within the analysis workflow, many of the elements will require a more direct form of migration and can be directly compared via unit tests. For example at some level one might consider ‘PROC MEANS’ to be similar to the ‘summary’ function in R. Neither the output structures nor the elements within these structures are directly comparable however both contain a mean of some variable, for example, that can be directly compared via a unit test. Mango’s in-house test structure allows for code improvements to be made whilst retaining the ability to compare key values via unit tests.
Appropriate training is important to the success of any software project. When that software is to become part of an analyst’s everyday workflow training is vital. Few companies embark on a code migration from SAS to R without any in-house knowledge of R so often the skill-base can be quite diverse. It is important to ensure that everyone who will use R has a base understanding of the language along with the specific training required to perform their everyday tasks. To ensure our customers are able to transition successfully from SAS to R training is provided by consultants with experience in both technologies so that the cultural implications of the move are understood and acknowledged. Additionally in order to develop, maintain and extend their R code in the long-term, developer training focussed on areas such as package building and code optimisation proves beneficial.
Migrating from SAS to R is a big decision for organisations which usually has one or more key drivers. It doesn’t happen overnight and careful planning and experience is required to ensure business as usual activities are not affected. Mango’s approach has successfully de-risked migrations and ensured successful projects with satisfied end users and management.