A while ago, I came across a mention of the Python math.fsum function, which sums a set of floating-point values exactly, then rounds to the closest floating point value. This seemed useful. In particular, I thought that if it’s fast enough it could be used instead of R’s rather primitive two-pass approach to trying to compute the sample mean more accurately (but still not exactly). My initial thought was to just implement the algorithm Python uses in pqR. But I soon discovered that there were newer (and faster) algorithms. And then I thought that I might be able to do even better…
The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).
A superaccumulator is a giant fixed-point number that can exactly represent any floating-point value (and then some, to allow for bigger numbers arising from doing sums). This concept has been used before to implement exact summation methods. But if done in software in the most obvious way, it would be pretty slow. In my paper, I introduce two new variations on this method. The “small” superaccumulator method uses a superaccumulator composed of 64-bit “chunks” that overlap by 32 bits, allowing carry propagation to be done infrequently. The “large” superaccumulator method has a separate chunk for every possible combination of the sign bit and exponent bits in a floating-point number (4096 chunks in all). It has higher overhead for initialization than the small superaccumulator method, but takes less time per term added, so it turns out to be faster when summing more than about 1000 terms.
Here is a graph of performance on a Dell Precision T7500 workstation, with a 64-bit Intel Xeon X5680 processor:
The horizontal axis is the number of terms summed, the vertical axis the time per term in nanoseconds, both on logarithmic scales. The time is obtained by repeating the same summation many times, so the terms summed will be in cache memory if it is large enough (vertical lines give sizes of the three cache levels).
The red lines are for the new small (solid) and large (dashed) superaccumulator methods. The blue lines are for the iFastSum (solid) and OnlineExact (dashed) methods of Zhu and Hayes (2010), which appear to be the fastest previous exact summation methods. The black lines are for the obvious (inexact) simple summation method (solid) and a simple out-of-order summation method, that adds terms with even and odd indexes separately, then adds together these two partial sums. Out-of-order summation provides more potential for instruction-level parallelism, but may not produce the same result as simple ordered summation, illustrating the reproducibility problems with trying to speed up non-exact summation.
One can see that my new superaccumulator methods are about twice as fast as the best previous methods, except for sums of less than 100 terms. For large sums (10000 or more terms), the large superaccumulator method is about 1.5 times slower than the obvious simple summation method, and about three times slower than out-of-order summation.
These results are all for serial implementations. One advantage of exact summation is that it can easily be parallelized without affecting the result, since the exact sum is the same for any summation order. I haven’t tried a parallel implementation yet, but it should be straightforward. For large summations, it should be possible to perform exact summation at the maximum rate possible given limited memory bandwidth, using only a few processor cores.
For small sums (eg, 10 terms), the exact methods are about ten times slower than simple summation. I think it should be possible to reduce this inefficiency, using a method specialized to such small sums.
However, even without such an improvement, the new superaccumulator methods should be good enough for replacing R’s “mean” function with one that computes the exact sample mean, since for small vectors the overhead of calling “mean” will be greater than the overhead of exactly summing the vector. Summing all the data points exactly, then rounding to 64-bit floating point, and then dividing by the number of data points wouldn’t actually produce the exactly-rounded mean (due to the effect of two rounding operations). However, it should be straightforward to combine the final division with the rounding, to produce the exactly-correct rounding of the sample mean. This should also be faster than the current inexact two-pass method.
Modifying “sum” to have an “exact=TRUE” option also seems like a good idea. I plan to implement these modifications to both “sum” and “mean” in a future version of pqR, though perhaps not the next version, which may be devoted to other improvements.