Design Flaws in R #1 — Reversing Sequences

Posted on August 6, 2008 by Radford Neal in R bloggers | 0 Comments

[This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R language for statistical computing has become the standard for academic statistical research, for the very good reason that it’s better than the alternatives. It’s far from perfect however. I could come up with a long “wish list” of desired features it lacks, but that’s not what I’ll do in this series of posts. Instead, I’ll concentrate on real design flaws — aspects of the language that make it all too easy to write unreliable programs. Many of these are inherited from R’s predecessors, S and S-Plus, and may even originate in the very earliest version of S, which was seen more as a calculator than as a programming language.

By far the most commonly-encountered design flaw in R is in the “:” operator for producing sequences, whose most common use is in “for” statements. Here’s an example:

   # TRAPEZOIDAL INTEGRATION FUNCTION.  Integrates
   # f(x,...) from x=a to x=b, using the trapezoidal
   # rule with n+1 points.

   trapezoidal.rule <- function(f,a,b,n,...)
   {
     h <- (b-a) / n
     s <- 0

     for (j in 1 : (n-1))
     { s <- s + f(a+j*h,...)
     }

     s*h + (f(a,...)+f(b,...))*(h/2)
   }

This approximates a definite integral by summing up the areas of n trapezoids defined using a grid of n+1 points from a to b. The area of one trapezoid is the width of its base, (b-a)/n, times the average of the heights of its two side, (f(left)+f(right))/2, where left and right are adjacent grid points. The function f is used twice at every grid point except for the very first and very last. To avoid wasting time, the program evaluates f only once at these n-1 middle points. This is handled by the “for” loop.

I’ve avoided a minor design flaw in R here, by using 1 : (n-1) to define the range of values for j in the loop, which should be 1, 2, …, n-1. Beginning R programmers usually write 1:n-1, and only find out after hours of debugging that this is interpreted as (1:n)-1, and so produces the sequence 0, 1, …, n-1. But at least they usually realize they have a bug.

The main problem is far more insidious. The program above works fine, as long as n is greater than 1. And it usually is, since you usually want to approximate the integral using more than one trapezoid, if you’re interested in getting an accurate result. But n=1 is an entirely valid value, and sooner or later will come up, perhaps when a user cares more about speed than accuracy, or perhaps when another program automatically chooses this value of n for some reason. Here’s an illustration of what happens:

   > trapezoidal.rule (function (x) x^2, 3, 6, 100)
   [1] 63.00045
   > trapezoidal.rule (function (x) x^2, 3, 6, 2)
   [1] 64.125
   > trapezoidal.rule (function (x) x^2, 3, 6, 1)
   [1] 202.5

The exact answer is 63. Even using n=2, one gets pretty close. But 202.5 when using n=1? The correct value for the approximation using one trapezoid is 3 x (3²+6²)/2 = 67.5.

What’s happened? Well, when n=1, the sequence 1 : (n-1) defining the range in the “for” statement is 1:0. Rather than interpreting this as the empty sequence, R decides that you must have wanted a sequence of decreasing values, namely 1 and 0. So the body of the “for” loop is done twice, rather than not at all, with the unfortunate result seen above.

Automatically reversing the direction of the sequence p:q when p>q must have seemed like a convenience to the original designers, who probably had in mind the situation where this expression is typed in directly by a user, who knows whether p>q or p. It’s a terrible idea in a programming language, however. Almost always, the logic of the program will require either an increasing sequence or a decreasing sequence, not one direction some times and the other direction other times. And in many programs, the empty sequence will be needed as a valid limiting case, and of course is never generated with automatic reversal.

Now, it clearly is possible to write working programs despite this design flaw. For a sequence starting at 1, seq(length=n) will work. You might think that seq(1,n,by=1) would also work, but instead it produces an error message when n=0. You can also use an “if” statement to bypass the “for” loop when it shouldn’t be done, as below:

if (n>1) { for (j in 1 : (n-1)) { s <- s + f(a+j*h,...) } }

Unfortunately, all these solutions are more work to write, and harder to read. It’s much easier to write programs that don’t guard against problems of reversing sequences (which crop up in places other than “for” loops too). The result is programs that mostly work, most of the time.

Can this flaw in R be fixed? At this point, changing the specification of the “:” operator is impossible — it would invalidate too many existing programs. New operators could be introduced, however. For example, “:>:” could generate an increasing sequence, with a:>:b producing the empty sequence if a>b. Similarly, “:<:” could generate a decreasing sequence. (Some might think these operators should be reversed, but I think they’re the right way around if you view “>” and “<” as arrows, not less than and greater than signs.) With these new operators, it would become almost as easy to write reliable programs as to write unreliable ones.

Related

Share Tweet

To leave a comment for the author, please follow the link and comment on their blog: Radford Neal's blog » R Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.