Automatic Simulation Queueing in R

Posted on December 28, 2010 by ramhiser in R bloggers, Uncategorized | 0 Comments

[This article was first published on johnramey » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I spend much of my time writing R code for simulations to compare the supervised classification methods that I have developed with similar classifiers from the literature. A large challenge is to determine which datasets (whether artificial/simulated or real) are interesting comparisons. Even if we restricted ourselves to multivariate Gaussian data, there are a large number of covariance matrix configurations that we could use to simulate the data. In other words, there are too many possibilities to consider all of them. However, it is often desirable to consider as many as possible.

Parallel processing certainly has reduced the runtime for simulations. In fact, most of my simulations are ridiculously parallelizeable, so I can run multiple simulations side-by-side.

I have been searching for ways to automate a lot of what I do, so I can spend less time on the mundane portions of simulation and focus on classification improvement. As a first attempt, I have written some R code that generates a Bash script that can be queued on my university’s high-performance computer. The code to create a Bash script is create.shell.file(), which is given here:

# Arguments:
#	shell.file: The name of the shell file (usually ends in '.sh').
#	r.file: The name of the R file that contains the actual R simulation.
#	output.file: The name of the file where all output will be echoed.
#	r.options: The options used when R is called.
#	sim.args: The simulation arguments that will be passed to the R file.
#
create.shell.file <- function(shell.file, r.file, output.file, r.options = "--no-save --slave", sim.args = NULL, chmod = TRUE, chmod.permissions = "750") {
	args.string <- ''
	if(!is.null(sim.args)) args.string <- paste('--args', sim.args)
	r.command <- paste('R', r.options, args.string, '<', r.file, '>', output.file)
	sink(shell.file)
		cat('#!/bin/bash\n')
		cat('#PBS -S /bin/bash\n')
		cat('echo "Starting R at `date`"\n')
		cat(r.command, '\n')
		cat('echo "R run completed at `date`"\n')
	sink()
 
	# If the chmod flag is TRUE, then we will chmod the created file to have the appropriate chmod.permissions.
	if(chmod) {
		chmod.command <- paste("chmod", chmod.permissions, shell.file)
		system(chmod.command)
	}
}

To actually queue the simulation, we make a call to queue.sim():

# Arguments:
#	sim.config.df: a dataframe that contains the current simulation configuration.
#	sim.name: The name of the simulation. The queued sim will be prepended to the queue name.
#	np: The number of processors to use for this simulation.
#	npn: The number of processors to use per node for this simulation.
#	email: The email address that will be notified upon completion or an error.
#	cleanup: Delete all of the shell files after the simulations are queued?
#	verbose: Echo the status of the current task?
#
queue.sim <- function(sim.config.df, sim.type = "rlda-duin", np = 1, npn = 1, email = "[email protected]", cleanup = FALSE, verbose = TRUE) {
	sim.config <- paste(names(sim.config.df), sim.config.df, collapse = "-", sep = "")
	sim.name <- paste(sim.type, "-", sim.config, sep = "")
	shell.file <- paste(sim.name, ".sh", sep = "")
	r.file <- paste(sim.type, '.r', sep = '')
	out.file <- paste(sim.name, '.out', sep = '')
	sim.args <- paste(sim.config.df, collapse = " ")
 
	if(verbose) {
		cat("sim.config:", sim.config, "\n")
		cat("sim.name:", sim.name, "\n")
		cat("shell.file:", shell.file, "\n")
		cat("r.file:", r.file, "\n")
		cat("out.file:", out.file, "\n")
		cat("sim.args:", sim.args, "\n")
	}
 
	if(verbose) cat("Creating shell file\n")
	create.shell.file(shell.file, r.file, out.file, sim.args = sim.args)
	if(verbose) cat("Creating shell file...done!\n")
 
	# Example
	# scasub -np 8  -npn 8 -N "rlda-prostate" -m "[email protected]" ./rlda-prostate.sh
	if(verbose) cat("Queueing simulation\n")
	queue.command <- paste("scasub -np ", np, " -npn ", npn, " -N '", sim.name, "' -m '", email, "' ./", shell.file, sep = "")
	if(verbose) cat("Queue command:\t", queue.command, "\n")
	system(queue.command)
	if(verbose) cat("Queueing simulation...done!\n")
 
	if(cleanup) {
		if(verbose) cat("Cleaning up shell files\n")
		file.remove(shell.file)
		if(verbose) cat("Cleaning up shell files...done\n")
	}
}

Let’s look at an example to see what is actually happening. Suppose that we have a simulation file called “gaussian-sim.r” that generates N observations from two different p-dimensional Gaussian distributions each having the identity covariance matrix. Of course, this is a boring example, but it’s a start. One interesting question that always arises is: “Does classification performance degrade for small values of N and (extremely) large values of p?” We may wish to answer this question with a simulation study by looking at many values of N and many values of p and see if we can find a cutoff where classification performance declines. Let’s further suppose that for each configuration that we will repeat the experiment B times. (As a note, I’m not going to actually examine the gaussian-sim.r file or its contents here. I may return to this example later and extend it, but for now I’m going to focus on the automated queueing.) We can queue the simulation for each of several configurations the following code:

library('plyr')
sim.type <- "gaussian-sim"
np <- 8
npn <- 8
verbose <- TRUE
cleanup <- TRUE
 
N <- seq.int(10, 50, by = 10)
p <- seq.int(250, 1000, by = 250)
B <- 1000
 
sim.configurations <- expand.grid(N = N, p = p, B = B)
 
# Queue a simulation for each simulation configuration
d_ply(sim.configurations, .(N, p, B), queue.sim, sim.type = sim.type, np = np, npn = npn, cleanup = cleanup, verbose = verbose)

This will create a Bash script with a descriptive name. For example, with the above code, a file called “gaussian-sim-N10-p1000-B1000.sh” is created. Here are its contents:

#!/bin/bash
#PBS -S /bin/bash
echo "Starting R at `date`"
R --no-save --slave --args 10 1000 1000 < gaussian-sim.r > gaussian-sim-N10-p1000-B1000.out
echo "R run completed at `date`"

A note about the shell file created. The actual call to R can be customized, but this call has worked well for me. I certainly could call R in batch mode, but I never do without any specific reason. Perhaps one is more efficient than the other? I’m not sure about this.

Next, for each *.sh file created, the following command is executed to queue the R script using the above configuration.

scasub -np 8 -npn 8 -N 'gaussian-sim-N10-p1000-B1000' -m '[email protected]' ./gaussian-sim-N50-p1000-B1000.sh

The scasub command is used for my university’s HPC. I know that there are other systems out there, but you can always alter my code to suit your needs. Of course, your R script needs to take advantage of the commandArgs() function in R to use the above code.

To leave a comment for the author, please follow the link and comment on their blog: johnramey » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Automatic Simulation Queueing in R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)