A course in statistical programming

Posted on May 25, 2012 by Karl Broman in R bloggers | 0 Comments

[This article was first published on The stupidest thing... » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

be self-sufficient
get the right answer
document what you did (so that you will understand what you did 6 months later)
if primary data change, be able to re-run the analysis without a lot of work
are your simulation results reproducible?
reuse of code (others’ and your own) rather than starting from scratch every time
make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
Emacs/vim/other editors (rstudio/eclipse)
Latex (for papers; for presentations)
slides for talks; posters; figures/tables
Advanced R (fancy data structures; functions; object-oriented stuff)
Advanced R graphics
R packages
Sweave/asciidoc/knitr
minimal Perl (or Python or Ruby); example of data manipulation
Minimal C (or C++); examples of speed-up
version control (eg git or mercurial); backups
reproducible research ideas
data management
managing projects: data, analyses, results, papers
programming style (readable, modular); general but not too general
debugging/profiling/testing
high-throughput computing; parallel computing; managing big jobs
finding answers to questions: man pages; documentation; web
more on visualization; dynamic graphics
making a web page; html & css; simple cgi-type web forms?
writing and managing email
managing references to journal articles

To leave a comment for the author, please follow the link and comment on their blog: The stupidest thing... » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A course in statistical programming

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)