By Lewis Rendell, R Consultant Intern, Mango Solutions
As a mathematics and statistics student, R has been invaluable during my studies. With the number of packages available for data analysis and the ease with which simple and complex functions can be written, R has essentially become my ‘mother tongue’ when it comes to programming. Having only ever used it in academic contexts however, I was eager to learn how R might be used in the commercial world, and so I have been fortunate enough to spend eight weeks this summer as an intern in Mango’s Consultancy team.
The main project that I was involved with while at Mango was the ValidR project. The open-source nature of R is one of its greatest assets – users across the globe can freely access its functionality and contribute their own code to online repositories. However, this can lead to some reluctance among those in regulated industries to adopt R as their main language for modelling and analytics. If anyone can build a package and upload it to CRAN, how can the user be sure that the code does what it claims to do?
ValidR looks to provide a solution for this issue, as a validated version of R that complies with the FDA’s guidelines for software validation. In practice, this means that each R package to be used is carefully analysed to assess its risk and determine its key requirements. A set of unit tests are then written by Mango to test this core functionality, with the results and findings of the process formally documented in a report for the end user of the package.
My main role during the project therefore involved the validation of individual packages. This work ranged from reading through code scripts and documentation in order to learn about what packages were intended to do, to writing, running and reviewing unit test scripts. This exposed to me a great number of incredibly useful packages that I hadn’t previously encountered during my degree: RSQLite, for example, gave me my first glimpse of how R can be used as an interface for database management systems. Another seemingly simple package, xtable, prints R objects as LaTeX tables that can simply be copied and pasted into a TeX file – something I wish I’d known about when writing my dissertation!
Beyond this, working on ValidR has introduced me to the world of unit testing. Until this summer, I’d only ever tested my code informally, sporadically trying out my functions on a couple of example inputs but never really keeping track of the output each time. Within Mango’s validation process, tests are written using Hadley Wickham’s testthat package. Using such a framework, allowing the automated running of tests whenever a package is built (as might be done by continuous integration software), makes it so much easier to identify when adjustments to code create problems, resulting in easy debugging. This, of course, is of particular use when implemented with a version control system, something that I can now hardly imagine working without.
Beyond all of these technological ideas however, perhaps the most important thing I’ve learned at Mango is just how much R really can do. It has been incredible to see how R is used outside of university campuses, by so many organisations as their main analytic tool.
As well as hearing from discussions around the office about Mango’s ongoing projects – many of which are absolutely fascinating – I was fortunate enough to attend this year’s EARL Conference in London, organised by Mango. Being able to speak to and hear from some of R’s major users and developers was a great experience, and really opened my eyes to the potential of R, both now and in the future. Of particular note was Joe Cheng’s keynote speech on Shiny gadgets, which introduce an interactive ‘codeless’ approach to data analysis within R, and a talk by Tom Liptrot of the Christie NHS Foundation Trust, whose use of text mining to identify cancers among patients must be one of the most novel and unexpected applications of R I have seen.
I thoroughly enjoyed my time at Mango and will certainly continue using all of my new R discoveries, hopefully attending some local R user groups and meet-ups for data scientists. This is only the staRt!