# Automatic Simulation Queueing in R

December 28, 2010
By

(This article was first published on johnramey » r, and kindly contributed to R-bloggers)

I spend much of my time writing R code for simulations to compare the supervised classification methods that I have developed with similar classifiers from the literature.  A large challenge is to determine which datasets (whether artificial/simulated or real) are interesting comparisons.  Even if we restricted ourselves to multivariate Gaussian data, there are a large number of covariance matrix configurations that we could use to simulate the data.  In other words, there are too many possibilities to consider all of them.  However, it is often desirable to consider as many as possible.

Parallel processing certainly has reduced the runtime for simulations.  In fact, most of my simulations are ridiculously parallelizeable, so I can run multiple simulations side-by-side.

I have been searching for ways to automate a lot of what I do, so I can spend less time on the mundane portions of simulation and focus on classification improvement.  As a first attempt, I have written some R code that generates a Bash script that can be queued on my university’s high-performance computer. The code to create a Bash script is create.shell.file(), which is given here:

 # Arguments: # shell.file: The name of the shell file (usually ends in '.sh'). # r.file: The name of the R file that contains the actual R simulation. # output.file: The name of the file where all output will be echoed. # r.options: The options used when R is called. # sim.args: The simulation arguments that will be passed to the R file. # create.shell.file <- function(shell.file, r.file, output.file, r.options = "--no-save --slave", sim.args = NULL, chmod = TRUE, chmod.permissions = "750") { args.string <- '' if(!is.null(sim.args)) args.string <- paste('--args', sim.args) r.command <- paste('R', r.options, args.string, '<', r.file, '>', output.file) sink(shell.file) cat('#!/bin/bash\n') cat('#PBS -S /bin/bash\n') cat('echo "Starting R at date"\n') cat(r.command, '\n') cat('echo "R run completed at date"\n') sink()   # If the chmod flag is TRUE, then we will chmod the created file to have the appropriate chmod.permissions. if(chmod) { chmod.command <- paste("chmod", chmod.permissions, shell.file) system(chmod.command) } }

To actually queue the simulation, we make a call to queue.sim():

 # Arguments: # sim.config.df: a dataframe that contains the current simulation configuration. # sim.name: The name of the simulation. The queued sim will be prepended to the queue name. # np: The number of processors to use for this simulation. # npn: The number of processors to use per node for this simulation. # email: The email address that will be notified upon completion or an error. # cleanup: Delete all of the shell files after the simulations are queued? # verbose: Echo the status of the current task? # queue.sim <- function(sim.config.df, sim.type = "rlda-duin", np = 1, npn = 1, email = "[email protected] /* */ ", cleanup = FALSE, verbose = TRUE) { sim.config <- paste(names(sim.config.df), sim.config.df, collapse = "-", sep = "") sim.name <- paste(sim.type, "-", sim.config, sep = "") shell.file <- paste(sim.name, ".sh", sep = "") r.file <- paste(sim.type, '.r', sep = '') out.file <- paste(sim.name, '.out', sep = '') sim.args <- paste(sim.config.df, collapse = " ")   if(verbose) { cat("sim.config:", sim.config, "\n") cat("sim.name:", sim.name, "\n") cat("shell.file:", shell.file, "\n") cat("r.file:", r.file, "\n") cat("out.file:", out.file, "\n") cat("sim.args:", sim.args, "\n") }   if(verbose) cat("Creating shell file\n") create.shell.file(shell.file, r.file, out.file, sim.args = sim.args) if(verbose) cat("Creating shell file...done!\n")   # Example # scasub -np 8 -npn 8 -N "rlda-prostate" -m "[email protected] /* */ " ./rlda-prostate.sh if(verbose) cat("Queueing simulation\n") queue.command <- paste("scasub -np ", np, " -npn ", npn, " -N '", sim.name, "' -m '", email, "' ./", shell.file, sep = "") if(verbose) cat("Queue command:\t", queue.command, "\n") system(queue.command) if(verbose) cat("Queueing simulation...done!\n")   if(cleanup) { if(verbose) cat("Cleaning up shell files\n") file.remove(shell.file) if(verbose) cat("Cleaning up shell files...done\n") } }

Let’s look at an example to see what is actually happening. Suppose that we have a simulation file called “gaussian-sim.r” that generates N observations from two different p-dimensional Gaussian distributions each having the identity covariance matrix. Of course, this is a boring example, but it’s a start. One interesting question that always arises is: “Does classification performance degrade for small values of N and (extremely) large values of p?” We may wish to answer this question with a simulation study by looking at many values of N and many values of p and see if we can find a cutoff where classification performance declines. Let’s further suppose that for each configuration that we will repeat the experiment B times. (As a note, I’m not going to actually examine the gaussian-sim.r file or its contents here. I may return to this example later and extend it, but for now I’m going to focus on the automated queueing.) We can queue the simulation for each of several configurations the following code:

 library('plyr') sim.type <- "gaussian-sim" np <- 8 npn <- 8 verbose <- TRUE cleanup <- TRUE   N <- seq.int(10, 50, by = 10) p <- seq.int(250, 1000, by = 250) B <- 1000   sim.configurations <- expand.grid(N = N, p = p, B = B)   # Queue a simulation for each simulation configuration d_ply(sim.configurations, .(N, p, B), queue.sim, sim.type = sim.type, np = np, npn = npn, cleanup = cleanup, verbose = verbose)

This will create a Bash script with a descriptive name. For example, with the above code, a file called “gaussian-sim-N10-p1000-B1000.sh” is created. Here are its contents:

 #!/bin/bash #PBS -S /bin/bash echo "Starting R at date" R --no-save --slave --args 10 1000 1000 < gaussian-sim.r > gaussian-sim-N10-p1000-B1000.out echo "R run completed at date"

A note about the shell file created. The actual call to R can be customized, but this call has worked well for me. I certainly could call R in batch mode, but I never do without any specific reason. Perhaps one is more efficient than the other? I’m not sure about this.

Next, for each *.sh file created, the following command is executed to queue the R script using the above configuration.

 scasub -np 8 -npn 8 -N 'gaussian-sim-N10-p1000-B1000' -m '[email protected] /* */ ' ./gaussian-sim-N50-p1000-B1000.sh

The scasub command is used for my university’s HPC. I know that there are other systems out there, but you can always alter my code to suit your needs. Of course, your R script needs to take advantage of the commandArgs() function in R to use the above code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...