HPC for biological research

August 28, 2011
By

(This article was first published on Inundata » R, and kindly contributed to R-bloggers)

In early May I had the opportunity to attend a workshop on using high performance computing in R hosted at Nimbios. I’ve been meaning to write a summary of the meeting ever since but got sidetracked by various other projects. Since a collaborator recently asked for meeting notes I finally took the time to write this post.

The meeting was jointly organized by folks from Nimbios and the remote data analysis and visualization group (rDAV). The idea behind the workshop was to introduce biologists dealing with big-data problems to a variety of analytical (mostly just R) and visualization tools (R and a few other open-source tools). The presentations were either technical (HPC resources, tools, demos) or application oriented.

Of the technical talks (HPC intro, utilities), the one I found most valuable from the workshop was by Pragnesh Patel from rDAV who did an excellent job outlining all the ins and outs of running R on a cluster. Slides from his talk are available here. A more recent summary from his UseR! 2011 presentation is available here.

On the application side, there were a couple of talks from Nimbios scientists. One by Michael Gilchrist on Evolutionary bioinformatics (pdf of slides) and the other by a Nimbios postdoc, William Godsoe on using hpc to build species distribution models [cite]10.1093/sysbio/syq005[/cite].

In addition to R, we also discussed other open-source tools for visualizing large datasets.

Although I wrote a detailed post on how to use Amazon’s EC2 cloud for HPC, this workshop convinced me to use resources that NSF already provides. Teragrid is a portal that provides access to numerous cluster resources funded by NSF or one of its partners.  Using their XSEDE portal (which has replaced POPS), academics can request an allocation for computing time. For new and exploratory projects, there are ‘starter grants’ where one can get a rather generous allocation within a fairly short time. Larger allocations involve a review process. If the efforts you currently seek time for are being actively funded, the review process is likely to move through faster since it has already been favorably reviewed. Amazon’s computing cluster is still a useful service but there is no need to spend grant money elsewhere when NSF already provides these resources. As more scientists use and acknowledge Teragrid’s resources in their publications, that will provide the incentive and justification for organizations like rDav to continue seeking funding, especially in todays budgetary climate.

To leave a comment for the author, please follow the link and comment on his blog: Inundata » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.