# Large search spaces using R

February 3, 2012
By

(This article was first published on alexfarquhar's posterous, and kindly contributed to R-bloggers)

I’m working on some really interesting stuff at the moment, the details of which I can’t discuss for reasons of national security (not really). However, one of the things I’ve been doing a lot of is searching though lots of different combinations of parameters to find an optimal solution.

Combinations of things are annoyingly…combinatorial. Let’s say I have a machine that generates gold coins. The machine has 6 dials, each with 10 different positions, and the machine creates different amounts of gold for each setting. How many settings do I have to search through to find the best one? The answer is 10 ^ 6 , or 1,000,000. That’s not too bad, you might think, we’ve got those computer thingies these days. But if each combination takes 100ms to measure, thats over 27 hours to test all combinations.

I have a very similar situation, where I’m working through millions of combinations. Each combination then needs to be tested (via some R code) to determine how “good” it is. If each test run takes over a day to complete then it’s a problem, especially if you want to keep iterating with different scoring algorithms.

Another issue is in the generation of each combination. Initially I relied on the lovely `expand.grid` function in R. This will create all combinations of it’s arguments (which are lists), and create a matrix, each row representing a different combination. This works fine for reasonable numbers, but once your combination count starts going into the millions, your memory starts running out…bah!

The answer to the memory issue was surprisingly simple – if there are n different parameters, each with 10 possible values, then think of each parameter occupying a position in an n-length integer e.g. 2314 would represent a combination with “2” as the value for parameter 1, “3” for parameter 2 etc. So if I wanted to search all combinations, I just go from 0000 to 9999, each time incrementing the number to get a new combination.

But what if don’t have 10 values for each parameter? What if we have 2 for param 1, 5 for param 2, 8 for param 3 and 2 for param 4? In this case I needed to implement a “special” kind of number, which had different rules for incrementing. Here’s the code:

` next.comb = function(p){`

```  if(all(sapply(p, function(d) d\$x == d\$max))){   return(NULL)  }  for(i in length(p):1){   val = p[[i]]\$x   new.val = val + 1   if(new.val <= p[[i]]\$max){    p[[i]]\$x = new.val    return(p)   } else {    p[[i]]\$x = p[[i]]\$min   }  }  return(p) } ```

` # start number p = list(list(x = 1, min = 1, max = 2),     list(x = 1, min = 1, max = 5),     list(x = 1, min = 1, max = 8),     list(x = 1, min = 1, max = 2)) # start combination: 1 => 1, 2 => 1, 3 => 1, 4 => 1 p = next.comb(p) # combination is now: 1 => 1, 2 => 1, 3 => 1, 4 => 2 p = next.comb(p) # combination is now: 1 => 1, 2 => 1, 3 => 1, 4 => 3 `

Ugly, huh? But it works and takes up almost no memory, allowing the traversal of millions (or even billions) of different combinations. Next time I’ll go into how I solved the other problem I outlined – the length of time to score every combination. Oh alright, I’ll tell you now – I just used the multicore package and mclapply to split the problem up – simple!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...