UseR!2008 aftermath

August 30, 2008 · Posted in R bloggers · Comments Off 

Earlier in August I was on the R user conference which this year took place in Dortmund. It was a quite an exciting event gathering around 500 people from around the globe and featuring 170 presentations and talks. Topics varied from new developments in the R system, newly implemented statistical methods as well as plethora of very interesting, and often visually nice, applications. You can find the list of abstracts and also slides for majority of the talks on the conference’s website.

I left the conference, I’m sure similarly to others, with a wealthy bag of new R “tricks”. The next UseR is planned to take place next summer in Rennes (France). The website is already there.

Among the available slides are the slides from my talk which was derived from my recent paper on coordination in dynamic social networks in heterogeneous groups. This talk focused on a computer simulation I performed for the paper. You can find the slides here. Additionally here (QuickTime, 37Mb) is a movie based on the simulation I describe there. More on the paper itself soon.


R Design Flaws #1 and #2: A Solution to Both?

August 25, 2008 · Posted in R bloggers · Comments Off 

I’ve previously posted about two design flaws in R. The first post was about how R produces reversed sequences from a:b when a>b, with bad consequences in “for” statements (and elsewhere). The second post was about how R by default drops dimensions in expressions like M[i:j,] when i:j is a sequence only one long (ie, when i equals j).

In both posts, I suggested ways of extending R to try to solve these problems. I now think there is a better way, however, which solves both problems with one simple extension to R. This extension would also make R programs run faster and use less memory.

Recall the problems, and the solutions I proposed previously… To stop sequences from reversing, we need a new operator to use rather than 1:n, which suffers from the reversal problem when n is zero (which can’t be changed for compatibility reasons). I suggested 1:>:n. To stop dimenions from being dropped, I suggested using semicolons rather than commas to separate subscripts. My new suggestion is that in the most common case where the indexing vector is a sequence, we could use the same new operator introduced to solve the reversing problem — ie, we write M[1:>:n,] to get the first n rows, and define it so that the result isn’t converted to a vector when n is one.

Now, this may seem like it won’t work. If 1:>:n returns a vector, then when n is one, it’s a vector of length one, which has to (by default) lead to the dimension being dropped if ordinary subscripting by scalars is to work (since scalars in R are really vectors of length one).

The solution is for 1:>:n to not return a vector, but instead return a new data type that just records its two operands. This new data type — perhaps it could be called an indexing pair — would have to be recognized by “for” statements and by the subscripting operator. A dimension indexed by such an indexing pair would never be dropped. One could add an operator :<: as well, for iterating or subscripting in descending sequence, though this is much less common. If this descending operator is omitted, I think .. (two periods) would be a better name than :>: for the ascending indexing pair operator (but it unfortunately has no obviously mnemonic descending counterpart).

One disadvantage of this solution is that it doesn’t address the dropped dimension problem in the general case where the vector index may not be a sequence. But indexing by an ascending sequence is by far the most common case, so having to write drop=FALSE for the others may be OK. (However, perhaps knowledge of the drop=FALSE option would be less common if it is needed less often.) Similary, this solution doesn’t address the problem of reversing sequences outside the context of for loops and subscripting, but those are the most common uses.

One extra advantage of introducing indexing pairs is that they will take up a trivial amount of storage. In contrast, if you use 1:1000000 to iterate a million times in a for loop, R will allocate 4 Megabyes of storage to hold this sequence. Producing this sequence also takes time, of course. It would be possible for R to avoid this cost when using 1:1000000 in a for loop, by treating this combination specially, but it doesn’t (in version 2.4.1 at least):

   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234834  6.3     467875 12.5   407500 10.9
   Vcells 104319  0.8     786432  6.0   690698  5.3
   > for (i in 1:1000000) if (i==1000000) print(gc())
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234855  6.3     467875 12.5   467875 12.5
   Vcells 604321  4.7     905753  7.0   720903  5.5
   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234904  6.3     467875 12.5   467875 12.5
   Vcells 104328  0.8     786432  6.0   720903  5.5

Notice how the memory usage goes up by 3.9 Megabytes during the for loop.

Another extra advantage of introducing :>: (or ..) is that the precedence of this operator could be made lower than that of any other operator (with no loss, since an indexing pair wouldn’t be a valid operand for any other operator). Expressions like 1:>:n-1 would then do what the programmer meant (unlike 1:n-1).

So, does anyone see any flaws in this solution?


Design Flaws in R #2 — Dropped Dimensions

August 19, 2008 · Posted in R bloggers · Comments Off 

In a comment on my first post on design flaws in the R language, Longhai remarked that he has encountered problems as a result of R’s default behaviour of dropping a dimension of a matrix when you select only one row/column from that dimension. This was indeed the design flaw that I was going to get to next! I think it also points to what is perhaps a deeper design flaw.

I’ll give a toy example of the problem, which shows up regularly in real (but more complicated) contexts. Consider the following function, whose arguments are a matrix X and a vector s of column indexes, and which returns a vector containing the Euclidean norm of each row of X, looking only at columns in s:

   subset.norms <- function (X, s)
   { sqrt(apply(X[,s]^2,1,sum))
   }

Here’s an example of its use:

   > M
        [,1] [,2] [,3] [,4]
   [1,]    1    4    7   10
   [2,]    2    5    8   11
   [3,]    3    6    9   12
   > subset.norms(M,c(1,3))
   [1] 7.071068 8.246211 9.486833

Perhaps it would be interesting to see how the norms get smaller as we drop leading dimensions:

   > for (k in 1:4) print(subset.norms(M,k:4))
   [1] 12.88410 14.62874 16.43168
   [1] 12.84523 14.49138 16.15549
   [1] 12.20656 13.60147 15.00000
   Error in apply(X[, s]^2, 1, sum) : dim(X) must have a positive length

Oops… When the loops gets to where k has the value 4, the subset k:4 contains just one dimension. When R comes to evaluating X[,s], it doesn’t return a matrix with one column, but rather a vector. The apply function doesn’t work on vectors, so we get an error message rather than the answer.

There’s a fix. The “[" operator takes a "drop"argument, which defaults to TRUE, giving the behaviour above, but which can be explicitly set to FALSE to disable conversion from a matrix to a vector in these circumstances. If we modify the function as follows:

   subset.norms2 <- function (X, s)
   { sqrt(apply(X[,s,drop=FALSE]^2,1,sum))
   }

We get the right answers:

   > for (k in 1:4) print(subset.norms2(M,k:4))
   [1] 12.88410 14.62874 16.43168
   [1] 12.84523 14.49138 16.15549
   [1] 12.20656 13.60147 15.00000
   [1] 10 11 12

There are two problems with this fix. One is that in some programs, one needs to put drop=FALSE in numerous places, making the code rather hard to read. The more serious problem, and what makes this a real design flaw, is that writing code without drop=FALSE is so much easier that it's often left out even when it is needed. Indeed, since the errors typically occur only for extreme cases, it's easy for the programmer to not realize there's a problem. (As an aside, this is another context where the reversing sequences problem arises — if n is 0, subset.norms2(M,1:n) should produce all zeros, but doesn't, not because of a bug in subset.norms2 but because 1:0 doesn't produce the empty sequence.)

Can we solve the dimenion dropping problem by just changing the default for drop from TRUE to FALSE? Of course not, since this would break too many existing programs. But even if backward compatibility weren't a problem, this wouldn't be a good solution, since the times when we want drop to be TRUE are even more numerous than when we want it to be FALSE. Look at the following, for instance:

   > M[2,3,drop=FALSE]
        [,1]
   [1,]    8

If drop defaults to FALSE, ordinary subscripting with single indexes for both dimensions will give a 1x1 matrix! Now, R will treat matrices as vectors (and scalars) as needed in many contexts, but propagating a dim attribute (what marks something as a matrix) all over the place will sooner or later cause problems.

What can be done at this point? I'm not sure. The best idea I can come up with is to let subscripts be separated by semicolons as well as commas, with use of a semicolon signaling that drop will be FALSE. In other words, M[i;j] would be equivalent to M[i,j,drop=FALSE], but much more concise. People could start using the semicolon whenever they're not trying to extract single elements, without breaking old programs.

The root cause of the dropped dimension problem seems to be that R does not have scalars. Instead, what looks like a scalar is actually a vector of length one. To R, there is no difference between M[2,3] and M[2:2,3:3]. So there's no way for the first of these to drop dimensions and the second to retain them.

It's probably hopeless to try to change now, but I think it would have been better for scalars to be treated as vectors or converted to vectors as necessary, without pretending that they're always vectors. Not having real scalars may seem like a unification or simplification, but it just creates problems. These start with the first thing a new user sees on trying R out, which is likely to be something like this:

   > 2+2
   [1] 4

Good. R knows how to add 2 and 2. But what's the "[1]" doing there?


High-performance computing in R at useR! 2008

August 12, 2008 · Posted in R bloggers · Comments Off 

The useR! 2008 meeting is about to commence. Although I am not able to go this year I will be keeping a close eye on the talks and slides that (I assume) will be posted.  Last years useR! meeting (which I attended) was a great experience and considering the list of participants for this years meeting, it is bound to be a very interesting meeting. Dirk Eddelbuettel has already posted the slides to his talk entitled “Introduction to High-Performance R” on his web site. Clearly, anyone using R on, say, computer clusters would find this highly relevant.

On a side note, it would be interesting to have some idea of how commonly R is used on computer clusters. I have been using R on our in-house cluster for a few years now, and although it may sound a bit counterintuitive to run a high-level interpreted language such as R in a high-performance computing environment such as a cluster (most people would use a low-level compiled language like C for example), using R has the big advantage of allowing very fast prototyping. In other words, although the R code runs slower on the cluster than native C code would do it takes much less time to code it. Of course, for certain classes of models there is no other option that using C (or a similar low-level language), but if you can get away with R, a cluster allows you to run your serial code (or parallelized R code) in parallel with potentially huge time savings.

This is from the “Mario’s Entangled Bank” blog ( http://pineda-krch.com ) of Mario Pineda-Krch, a theoretical biologist at the University of California, Davis.


Updated version of GGobi: 2.1.8

August 12, 2008 · Posted in R bloggers · Comments Off 
We've just posted an updated version of GGobi - version 2.1.8. This version fixes a number of small bugs, particularly with the
DescribeDisplay plugin. If you are on windows, make sure to also update to the latest version of gtk linked from the downloads page. Unfortunately we have been having problems with the mac package, so only linux and windows versions are currently available.

All files are available from http://ggobi.org/downloads

Please us know if you run into any problems.

Link C with R

August 7, 2008 · Posted in R bloggers · Comments Off 
I should have posted this earlier. In my previous post, I have discussed the speed issue in R briefly. The speed issue unfortunately are two of the shortcomings of R. The other one is that R does not handle big data set very well. So instead of waiting for R to improve, for those who cannot wait (me included), we have to go back to the source. By this I mean that we should program some

Design Flaws in R #1 — Reversing Sequences

August 6, 2008 · Posted in R bloggers · Comments Off 

The R language for statistical computing has become the standard for academic statistical research, for the very good reason that it’s better than the alternatives. It’s far from perfect however. I could come up with a long “wish list” of desired features it lacks, but that’s not what I’ll do in this series of posts. Instead, I’ll concentrate on real design flaws — aspects of the language that make it all too easy to write unreliable programs. Many of these are inherited from R’s predecessors, S and S-Plus, and may even originate in the very earliest version of S, which was seen more as a calculator than as a programming language.

By far the most commonly-encountered design flaw in R is in the “:” operator for producing sequences, whose most common use is in “for” statements. Here’s an example:

   # TRAPEZOIDAL INTEGRATION FUNCTION.  Integrates
   # f(x,...) from x=a to x=b, using the trapezoidal
   # rule with n+1 points.

   trapezoidal.rule <- function(f,a,b,n,...)
   {
     h <- (b-a) / n
     s <- 0

     for (j in 1 : (n-1))
     { s <- s + f(a+j*h,...)
     }

     s*h + (f(a,...)+f(b,...))*(h/2)
   }

This approximates a definite integral by summing up the areas of n trapezoids defined using a grid of n+1 points from a to b. The area of one trapezoid is the width of its base, (b-a)/n, times the average of the heights of its two side, (f(left)+f(right))/2, where left and right are adjacent grid points. The function f is used twice at every grid point except for the very first and very last. To avoid wasting time, the program evaluates f only once at these n-1 middle points. This is handled by the “for” loop.

I’ve avoided a minor design flaw in R here, by using 1 : (n-1) to define the range of values for j in the loop, which should be 1, 2, …, n-1. Beginning R programmers usually write 1:n-1, and only find out after hours of debugging that this is interpreted as (1:n)-1, and so produces the sequence 0, 1, …, n-1. But at least they usually realize they have a bug.

The main problem is far more insidious. The program above works fine, as long as n is greater than 1. And it usually is, since you usually want to approximate the integral using more than one trapezoid, if you’re interested in getting an accurate result. But n=1 is an entirely valid value, and sooner or later will come up, perhaps when a user cares more about speed than accuracy, or perhaps when another program automatically chooses this value of n for some reason. Here’s an illustration of what happens:

   > trapezoidal.rule (function (x) x^2, 3, 6, 100)
   [1] 63.00045
   > trapezoidal.rule (function (x) x^2, 3, 6, 2)
   [1] 64.125
   > trapezoidal.rule (function (x) x^2, 3, 6, 1)
   [1] 202.5

The exact answer is 63. Even using n=2, one gets pretty close. But 202.5 when using n=1? The correct value for the approximation using one trapezoid is 3 x (32+62)/2 = 67.5.

What’s happened? Well, when n=1, the sequence 1 : (n-1) defining the range in the “for” statement is 1:0. Rather than interpreting this as the empty sequence, R decides that you must have wanted a sequence of decreasing values, namely 1 and 0. So the body of the “for” loop is done twice, rather than not at all, with the unfortunate result seen above.

Automatically reversing the direction of the sequence p:q when p>q must have seemed like a convenience to the original designers, who probably had in mind the situation where this expression is typed in directly by a user, who knows whether p>q or p<q. It’s a terrible idea in a programming language, however. Almost always, the logic of the program will require either an increasing sequence or a decreasing sequence, not one direction some times and the other direction other times. And in many programs, the empty sequence will be needed as a valid limiting case, and of course is never generated with automatic reversal.

Now, it clearly is possible to write working programs despite this design flaw. For a sequence starting at 1, seq(length=n) will work. You might think that seq(1,n,by=1) would also work, but instead it produces an error message when n=0. You can also use an “if” statement to bypass the “for” loop when it shouldn’t be done, as below:

   if (n>1)
   { for (j in 1 : (n-1))
     { s <- s + f(a+j*h,...)
     }
   }

Unfortunately, all these solutions are more work to write, and harder to read. It’s much easier to write programs that don’t guard against problems of reversing sequences (which crop up in places other than “for” loops too). The result is programs that mostly work, most of the time.

Can this flaw in R be fixed? At this point, changing the specification of the “:” operator is impossible — it would invalidate too many existing programs. New operators could be introduced, however. For example, “:>:” could generate an increasing sequence, with a:>:b producing the empty sequence if a>b. Similarly, “:<:” could generate a decreasing sequence. (Some might think these operators should be reversed, but I think they’re the right way around if you view “>” and “<” as arrows, not less than and greater than signs.) With these new operators, it would become almost as easy to write reliable programs as to write unreliable ones.


Diag| Memory: Current usage: 36291 KB
Diag| Memory: Peak usage: 37061 KB