Inaccurate results from microbenchmark

[This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The microbenchmark package is a popular way of comparing the time it takes to evaluate different R expressions — perhaps more popular than the alternative of just using system.time to see how long it takes to execute a loop that evaluates an expression many times. Unfortunately, when used in the usual way, microbenchmark can give inaccurate results.

The inaccuracy of microbenchmark has two main sources — first, it does not correctly allocate the time for garbage collection to the expression that is responsible for it, and second, its summarizes the results by the median time for many repetitions, when the mean is what is needed. The median and mean can differ drastically, because just a few of the repetitions will include time for a garbage collection. These flaws can result in comparisons being reversed, with the expression that is actually faster looking slower in the output of microbenchmark.

Here’s an example of this, using R-3.0.2 on an Ubuntu Linux 12.04 system with an Intel X5680 processor. First some setup, defining a vector v to be operated on, a vector of indexes, m, for an initial part of v, and variables used to repeat expressions k times:

> library(microbenchmark)
> set.seed(1)
> v <- seq(0,1,length=100000)
> m <- 1:42000
> k <- 1000
> n <- 1:k

We can now compare two ways of computing the same thing using microbenchmark:

> (m1a <- microbenchmark (u <- v[m]^2+v[m], u <- (v^2+v)[m],
+           times=k))
Unit: microseconds
               expr     min       lq   median       uq      max neval
 u <- v[m]^2 + v[m] 549.607 570.2220 816.0170 902.6045 11381.40  1000
  u <- (v^2 + v)[m] 472.859 682.3785 749.4225 971.9475 11715.08  1000

Looking at the median time, it seems that u <- (v^2+v)[m] is 816.0/749.4 = 1.089 times faster than u <- v[m]^2+v[m].

Alternatively, we could use system.time to see how long a “for” loop evaluating one of these expressions k times takes (k was set to 1000 and n was set to 1:k above):

> system.time(for (i in n) u <- v[m]^2+v[m])
   user  system elapsed
  0.808   0.060   0.869
> system.time(for (i in n) u <- (v^2+v)[m])
   user  system elapsed
  0.912   0.092   1.006

This gives the opposite result! From this output, it seems that u <- v[m]^2+v[m] is 1.006/0.869 = 1.158 times faster than u <- (v^2+v)[m]. Multiplying these factors, we see that the comparisons using microbenchmark and system.time differ by a factor of 1.089*1.158 = 1.26.

Maybe one could argue that a 26% error is often not that bad — I did fiddle the lengths of v and m for this example so that the error would actually make a difference in which expression seems faster — but the main reason people use microbenchmark is apparently that they think it is more accurate than system.time. So which gives the right answer here?

To find out, the first thing to do is to repeat the same commands again. As seen in the full output of this example, the repetition produces much the same results. (This isn’t always the case, though — appreciably different results in repetitions can occur when, for example, R decides to change the level of memory usage that triggers a garbage collection.)

We can also try the order="block" control option for microbenchmark. This tells it to do all 1000 repetitions of the first expression, and then all 1000 repetitions of the second expression, rather than randomly shuffling them, as is the default. The results aren’t much different:

> (m2a <- microbenchmark (u <- v[m]^2+v[m], u <- (v^2+v)[m],
+           times=k, control=list(order="block")))
Unit: microseconds
               expr     min      lq   median       uq      max neval
 u <- v[m]^2 + v[m] 548.558 566.785 816.6940 830.7535 11706.04  1000
  u <- (v^2 + v)[m] 471.602 543.331 743.3735 836.7810 12162.34  1000

However, we can see what’s going on if we plot the individual times that microbenchmark records for the 2000 repetitions (1000 of each expression). Here are the plots for the second repetition with both random and block order:


The red dots are times for u <- v[m]^2+v[m], the green dots for u <- (v^2+v)[m]. Note the log scale for time.

Some of the repetitions take about 20 times longer than others. The long ones clearly must be for evaluations in which a memory allocation request triggers a full garbage collection. We can now see that the results using the median from microbenchmark are going to be wrong.

First, notice that with block ordering, 1000 evaluations of the first expression result in 14 full garbage collections, whereas 1000 evaluations of the second expression result in 24 full garbage collections. With random order, however, there are 20 full garbage collections during evaluations of the first expression, and 25 during evaluations of the second expression. Random ordering has obscured the fact that the second expression is responsible for almost twice as much garbage collection time as the first, presumably because it allocates larger vectors for intermediate results such as v^2 (versus v[m]^2).

But why, then, does block ordering give almost the same result as random ordering? Because we are looking at microbenchmark’s output of the median time to evaluate each expression. The median will be almost totally insensitive to the time taken by the small number (about 2%) of evaluations that trigger full garbage collections. So an expression whose evaluation requires more garbage collections will not be penalized for the time they take, even assuming that this time is correctly assigned to it (which happens only with block ordering).

This explanation suggests that using microbenchmark with block ordering, and then looking at the mean evaluation time for each expression would give a more accurate result, agreeing with the result found with system.time. And indeed it does in this example — in the two repetitions, system.time gives ratios of times for the two expressions of 1.158 and 1.136, while microbenchmark with block ordering gives ratios of mean times for the two expressions of 1.131 and 1.137.  (These ratios of mean times can be computed from the data saved by microbenchmark, as seen in the R program I used, though the package provides no convenience function for doing so.)

So microbenchmark’s problem with this example can be fixed by using block ordering and taking the mean time for repetitions. But there’s no reason to think the time it then produces is any more accurate than that from system.time. It’s probably less accurate. The overhead of the “for” loop used with system.time is very small; in particular, since a change I suggested was incorporated in R-2.12.0, there is no memory allocation from the “for” loop iteration itself. The overhead of measuring the time for every repetition in microbenchmark is greater, and though microbenchmark attempts to subtract this overhead, one might wonder how well this works. Executing the time measurement code between each repetition will also disturb the memory cache and other aspects of processor state, which may affect the time to evaluate the expression.

More fundamentally, nanosecond timing precision is just not useful for this task, since to correctly account for garbage collection time, the number of repetitions must be large enough for many garbage collections to have occurred, which will lead to run times that are at least a substantial fraction of a second. Millisecond timing precision is good enough.

Furthermore, there’s no point in measuring the relative speeds of different expressions to more than about 5% accuracy, since recompiling the R interpreter after any slight change (even in code that is never executed) can change relative timings of different expressions by this much — presumably because changes in exact memory addresses affect the behaviour of memory caching, branch prediction, or other arcane processor tricks.

The data gathered by microbenchmark could be useful. For instance, it would be interesting to investigate the variation in evaluation time seen in the plots above beyond the obvious full garbage collections. Some is presumably due to less-than-full garbage collections; there could be other effects as well. But the microbenchmark package isn’t useful for the purpose for which it was designed.

To leave a comment for the author, please follow the link and comment on their blog: Radford Neal's blog » R Programming. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)