Managing a statistical analysis project – guidelines and best practices

September 30, 2010
By

(This article was first published on R-statistics blog, and kindly contributed to R-bloggers)

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

  1. The R tag page on Stackoverflow, and
  2. Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project.  They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together.  And with his permission, I am now republishing it here.  I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

From here on is what “chl” wrote:

These guidelines where compiled from SO (as suggested by @Shane), Biostar (hereafter, BS), and SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.

Data management

  • create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni/BS)
  • never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
  • check data consistency (whuber /SE)

Coding

  • organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
  • separate source code from editing stuff, especially for large project — partly overlapping with previous item and reporting
  • document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file
  • [R] custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)

Analysis

  • don’t forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
  • for Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatramay be a good candidate, giovanni /BS)
  • don’t limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)

Versioning

  • use some kind of CVS for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) — this follows from nice questions asked by @Jeromy and @Tal
  • backup everything, on a regular basis (Sharpie/JD Long /SO)
  • keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) — partly redundant with the previous item since it is available in Git

Editing/Reporting

To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.