A course in statistical programming

May 25, 2012
By

(This article was first published on The stupidest thing... » R, and kindly contributed to R-bloggers)

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

  • be self-sufficient
  • get the right answer
  • document what you did (so that you will understand what you did 6 months later)
  • if primary data change, be able to re-run the analysis without a lot of work
  • are your simulation results reproducible?
  • reuse of code (others’ and your own) rather than starting from scratch every time
  • make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

  • Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
  • Emacs/vim/other editors (rstudio/eclipse)
  • Latex (for papers; for presentations)
  • slides for talks; posters; figures/tables
  • Advanced R (fancy data structures; functions; object-oriented stuff)
  • Advanced R graphics
  • R packages
  • Sweave/asciidoc/knitr
  • minimal Perl (or Python or Ruby); example of data manipulation
  • Minimal C (or C++); examples of speed-up
  • version control (eg git or mercurial); backups
  • reproducible research ideas
  • data management
  • managing projects: data, analyses, results, papers
  • programming style (readable, modular); general but not too general
  • debugging/profiling/testing
  • high-throughput computing; parallel computing; managing big jobs
  • finding answers to questions: man pages; documentation; web
  • more on visualization; dynamic graphics
  • making a web page; html & css; simple cgi-type web forms?
  • writing and managing email
  • managing references to journal articles

To leave a comment for the author, please follow the link and comment on his blog: The stupidest thing... » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.