Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.
For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.
Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.
Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.
I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.
I have in mind several basic principles:
- be self-sufficient
- get the right answer
- document what you did (so that you will understand what you did 6 months later)
- if primary data change, be able to re-run the analysis without a lot of work
- are your simulation results reproducible?
- reuse of code (others’ and your own) rather than starting from scratch every time
- make methods accessible to (and used by) others
Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.
- Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
- Emacs/vim/other editors (rstudio/eclipse)
- Latex (for papers; for presentations)
- slides for talks; posters; figures/tables
- Advanced R (fancy data structures; functions; object-oriented stuff)
- Advanced R graphics
- R packages
- minimal Perl (or Python or Ruby); example of data manipulation
- Minimal C (or C++); examples of speed-up
- version control (eg git or mercurial); backups
- reproducible research ideas
- data management
- managing projects: data, analyses, results, papers
- programming style (readable, modular); general but not too general
- high-throughput computing; parallel computing; managing big jobs
- finding answers to questions: man pages; documentation; web
- more on visualization; dynamic graphics
- making a web page; html & css; simple cgi-type web forms?
- writing and managing email
- managing references to journal articles