Code Optimization: One R Problem, Eleven Solutions – Now Thirteen!

Posted on November 7, 2011 by Tony Breyal in R bloggers | 0 Comments

[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following up from my previous post “Code Optimisation: One R Problem, Ten Solutions – Now Eleven!” I figured out a twelfth solution after writing that blog post. Furthermore, half way through writing this blog post I figured out a thirteenth solution too. As a recap, the problem is taken from rwiki where the goal is to find the most efficient code to produce strings in the following pattern, given a single integer input n:

# n = 4
[1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004"
# n = 5
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004"
 [9] "i003.005" "i004.005"
# n = 6
 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005"
 [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"

The solution I had come up with was as follows:

generateIndex11 <- function(n) {
  # initialise vectors
  len <- (n-1)*n/2
  s <- vector(mode = "character", length = n)
  ind.1 <- vector(mode = "integer", length = len)
  ind.2 <- vector(mode = "integer", length = len)

  # set up strings
  s <- sprintf("%03d", seq_len(n))

  # calculate indicies
  ind.1 <- rep.int(1:(n-1), (n-1):1)
  ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)

  # paste strings together
  return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}

However, once I had finished writing that blog post, I realised that a further improvement was possible by not having to calculate ind.1 at all. This is because ind.2 has all the information we need:

n=5; lapply(2:n, ":", n)
[[1]]
[1] 2 3 4 5

[[2]]
[1] 3 4 5

[[3]]
[1] 4 5

[[4]]
[1] 5

Did you notice it? The pattern we need is effectively “1 with 2,3,4,5 “, “2 with 3,4,5″, “3 with 4, 5″ and “4 with 5″. All of that information is available in the one list above. Therefore, simplifying, a twelfth solution is:

generateIndex12 <- function(n) {
  s <- sprintf("%03d", seq_len(n))
  unlist(lapply(1:(n-1),
         function(i) {
             paste("i", s[i], ".", s[seq.int(i+1, n)], sep = "")
         }), use.names = FALSE)
}

# check
identical(generateIndex11(500), generateIndex12(500)
[1] TRUE

I would love to use the Rcpp package to produce a C++ version of the code above but unfortuantly that is beyound my current skill set but I’m sure it would result is a significant speed up. Perhaps.

Furthermore, a slightly faster thirteenth solution if we remember that initialising vectors and calculating indices independently before using lapply adds some speed too:


generateIndex13 <- function(n) {
  # initialise vectors
  s <- vector(mode = "character", length = n)
  ind <- vector(mode = "integer", length = (n-1)*n/2)
  # set up n unique strings
  s <- sprintf("%03d", seq_len(n))

  # calculate sequence of indicies for second part of strings e.g. "002" "003" etc in "i001.002" "i001.003"...
  ind <- lapply(2:n, ":", n)

  # paste strings together
  unlist(lapply(1:(n-1), function(i) paste("i", s[i], ".", s[ind[[i]]], sep = "") ), use.names = FALSE)
}
# check
identical(generateIndex11(500), generateIndex13(500)
[1] TRUE

Benchmarking the above (without compiled versions because of constraints on time)

# load packages
library(ggplot2)
library(rbenchmark)

get_bm <- function(n) {
  print(n)
  gc()
  df = benchmark(generateIndex10(n), generateIndex11(n), generateIndex12(n), generateIndex13(n),
                 columns = c("test", "elapsed", "relative"),
                 order = "elapsed",
                 replications = 100)
  df$n = n
  return(df)
}

# get benchmarks for various values of n
df.list <- lapply(seq(1000, 10000, by = 200), get_bm)
df <- do.call("rbind", df.list)

# plot results
ggplot(df.compiled, aes(x=n, y=elapsed, colour=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")

Bearing in mind that each algorithm is replicated 100 times at each value of n, the thirteenth solution produces a speed up at larger values of n.

Benchmarks performed on R version 2.13.0 on a Windows Server 2008 R2 x64 Workstation PC, 12GB RAM, Intel Xeon CPU X5650 @ 2.66GHz.

So, anyone have a Fouteenth solution? I’d love to see a solution that is paralleled into independent threads as I think that is where the next speed up will come from. Or how about a version which makes use of the inline and Rcpp packages with the thirteenth solution re-written C++? I have to admit that evolving an algorithm in R is much easier than in C++ (not that I’ve used the latter in nearly 8 years).

Happy coding!