On his Psychology and Statistics blog, Jeromy Anglim tells how he was analyzing some data from a skill acquisition experiment. Needing to run a custom R function across 1.3 million data points, Jeromy estimated it would take several hours for the computation to complete. So, Jeromy set out to optimise the code.
First, he used the Rprof function, which inspects your R functions as they run, and counts the amount of time spent in each sub-function. This is a useful tool to identify the parts of your functions that are ripe for optimisation, and in this case (with some help from the system.time function to time a specific section of the code) he learned that most of the time wasn’t taken performing actual calculations: most time was actually spent selecting the subset of the data to analyze!
And thus a solution was born: rather than repeatedly selecting from the large data frame in an iterative loop, he instead split the data frame into its constituent parts once, and then looped over the parts. This reduced the analysis time from hours down to just a couple of minutes. As the end of his case study, Jeromy shares some valuable lessons learned about optimising R functions:
- R is very fast most of the time.
- A single slow command can be the cause of a slow analysis.
- system.time is a very useful function.
- Optimisation can proceed from theory or from experimentation.
- Optimisation proceeds from diagnosing the cause of the problem to exploring solutions.
- Optimisation is about orders of magnitude. Focus on saving hours before tackling saving seconds.
Jeromy Anglim’s Blog: Psychology & Statistics: A Case Study in Optimising Code in R