[This post was written between 2012 and 2015]
I started using statistics for my research sometime in 1999 or 2000. I was a student at Ohio State, Linguistics, and I had just gotten interested in psycholinguistics. I knew almost nothing about statistics at that time. I did one Intro to Stats course in my department with Mike Broe (4 weeks), and that was it. In 1999 I developed repetitive strain injury, partly from using Excel and SPSS, and started googling for better statistical software. Someone pointed me to |stat, but eventually I found R. That was a transformative moment.
The next stage in my education came in 2000, when I decided to go to the Statistical Consulting department at OSU and showed them my repeated measure ANOVA analyses. The response I got was: why are you fitting ANOVAs? You need linear mixed models. The statisticians showed me what I had to do code-wise, and I went ahead and finished my dissertation work using the nlme package. The Pinheiro and Bates book had just come out then and I got myself a copy, understanding almost nothing in the book beyond the first few chapters.
After that, I published a few more papers on sentence processing using nlme and then lmer, and in 2011 I co-wrote a book with Mike Broe (the basic template of the book was based on his lecture notes at OSU, he had used Mathematica or something like that, but I used R and expanded on his excellent simulation-based approach). This book revealed the incompleteness of my understanding, as spelled out in the scathing (and well-deserved) critique by Christian Robert. Even before this review came out, I had already realized in early 2011 that I didn’t really understand what I was doing. My sabbatical was coming up in winter 2011, and I enrolled for the graduate certificate in statistics at Sheffield to get a better understanding of statistical theory. Here is my review of the distance-based graduate certificate in statistics taught at Sheffield.
At the end of that graduate certificate, I felt that I still didn’t really understand much that was of practical relevance to my life as a researcher. That led me to do the MSc in Statistics at Sheffield, which I have been doing over three years (2012-15). This is a review of the MSc program. I haven’t actually finished the program yet, but I think I know enough to write the review. My hope is that this overview will provide others a guide-map on one possible route one can take to achieving better understanding of data analysis, and what to expect if one takes this route.
Short version of this review: The three year distance MSc program at Sheffield is outstanding. I highly recommend it to anyone wanting to acquire a good, basic understanding of statistical theory and inference. You can alternatively do the course over two years (probably impossible or very hard if you are also working full time, like me), or over one year full time (I don’t know how people can do the degree in one year and still enjoy it). Be prepared to work hard and to find your own answers.
Cost: For EU citizens, the three-year part time program costs about 2000 British pounds a year, not including the travel costs to get to Sheffield for the annual exams and presentations. For non-EU citizens, it’s about 5000 pounds a year, still cheaper than most US programs.
Summary notes of the MSc program: I made summary notes for the exams during the three years. These are still in progress and are available from:
The courses I found most interesting and practically useful for my own research were Linear Modelling, Inference (Bayesian Statistics and Computational Inference), Medical Statistics, and Dependent data (Multivariate Analysis).
Course structure: Over three years, one does two courses each year, plus a dissertation. One has to commit about 15-20 hours a week in the 3-year program, although I think I did not do that much work, more like 12 hours a week on average (I had a lot of other work to do and just didn’t have enough time to devote to statistics). There are four 3 hour sort-of open book exams that one has to go to Sheffield for, plus a group oral presentation, a simulated consultation, and project submissions. Every course has regular assignments/projects, all are graded but only a subset count for the final exam (15% of the final grade). The minimum you have to get to pass is 50%.
The MSc program is taught to residential students and to distance students in parallel: the residentials are there in Sheffield, attending lectures etc. The distance students follow the course over a mailing list. So, someone like me, who’s doing the course over three years, is going to overlap with three batches of the MSc residential students. This has the effect that one has no classmates one knows, except maybe others who are doing the same three-year sequence with you.
The exams, which are the most stressful part of the program, are open book in that one can bring lecture notes and one’s own but no textbooks. However, the exams are designed in such a way that if you don’t already know the material inside out, there is almost no point in taking lecture notes in with you—there won’t be enough time to look up the notes. I did take the official lecture notes with me for the first three exams, but I never once opened them. Instead, I only relied on my own summary sheets. Also, the exams are designed so that most people can’t finish the required questions (any 5 out of 6) in the three hours. At least I never managed to finish all the questions to my satisfaction in any exam.
The first year (2012-13)
The first year courses were 6002 (Stats Lab) and 6003 (Linear Modelling). There was a project-based assessment for the first, and a 3 hour exam for the second.
6002 (Stats Lab): most of the course was about learning R, which anyone who had done the grad certificate did not need. It was only in the last weeks that things got interesting, with optimization. I didn’t like the notes on optimization and MLE much, though. There wasn’t enough detail, and I had to go searching in books and on the internet to find comprehensive discussions. Here I would recommend Ben Bolker’s chapters 6-8, which are on his web page, complete with .Rnw files. Also, I just found a neat looking book (not read yet) which I wish I had had in 2012: Modern Optimization with R.
Overall the Stats Lab course had the feel of an intro to R, which is what it should have been called. It should have been possible to test out of such a course—I did not need to read the first 12 of 13 chapters over 9 months, I could have done it in a week or less, I’m sure that’s true for those of my classmates who did the graduate certificate. However, I do see the point of the course for non-R users. I guess this is the perennial problem of teaching; students come in with different levels, you have to cater to the lowest common denominator. Also, the introduction to R is pretty dated and needs a major overhaul. Much has happened since Hadley Wickham arrived on the scene, and it’s a shame not to use his packages. Finally, the absence of literate programming tools was surprising to me. I expected it to be a standard operating procedure in statistics to use Sweave or the like.
6003 (Linear Modelling): this course was absolutely amazing. The lecture notes were very well-written and very detailed (with some exceptions, noted below). Linear mixed models didn’t get a particularly detailed treatment; I would have preferred a matrix presentation of LMM theory, and would have liked to learn how to implement these models myself.
Some problems I faced in year 1:
One issue in the course was the slow return of corrected assignments. By the time the assignment comes back graded (well, we just get general feedback and a grade), you’ve forgotten the details. Another strange aspect is that the grades for assignments were sometimes sent by regular air-mail. This was surprising in an online course.
One frustrating aspect of the courses was that a number of statements were made without any justification, proof, or further explanation. Example: “In R the default choice is the corner-point constraints given above, but in SPlus the default is the Helmert form, which is more convenient computationally, though more difficult to interpret.” Wow, I want to know more! But this point is never discussed again. One consequence is a feeling that one must simply take certain facts as given (or work it out yourself). I think it would have been helpful to point the interested student to a reference.
The responses to questions on the mailing list are sometimes slow to come. Answers to questions asked online sometimes didn’t really address the question, and one was left in the same state of uncertainty as earlier (a familiar feeling when you talk to a statistician!).
Where the graduate certificate shone was in the excruciatingly detailed feedback; this was where I learnt the most in that course. By contrast, the feedback to some of the assignments was pretty sketchy. I never really knew what a perfect solution would have looked like.
Of course, I can see why all this happens: professors are busy, and not always able to respond quickly to questions. I myself am sometimes just as slow to respond as a teacher; I guess I need to work on that aspect of my own teaching.
My final marks in these first-year courses were 63 per cent in each course.
The second year (2013-14)
The second year courses were 6001 (Data Analysis) and 6004 (Inference: Bayesian Statistics and Computational Inference). There was a project-based assessment for the first, and a 3 hour exam for the second.
In Data Analysis we did several projects which simulated real-life consulting, or involved doing actual experiments (e.g., building aeroplanes). There was one project where one had to choose a news media article about a piece of scientific work, and then compare it with the actual scientific work. The consulting project didn’t work so well for me, because we were teamed up in fives and we didn’t know each other. It was very hard to coordinate a project when all your colleagues are unknown to you, and email is the only way to communicate.
For the news media article, I chose the article Gelman attacked on his blog, about women wearing red to signal sexual availability. It was interesting because the claims in the Psych Science didn’t really pan out. I reanalyzed the original data, and found that the effect was driven by pink, not red; the authors had recoded red and pink as red or pink, presumably in order to make the claim that women wear reddish hues. It’s hard to believe that this was not a post-hoc step after seeing the data (although I think the authors claim it was not—I suppose it’s possible that it wasn’t); after all, if they had originally intended to treat red and pink as one unit color type, then why did they have two columns, one for red and one for pink?
The Data Analysis course was definitely not challenging; it was rather below the level of data analysis I have to do in my own research. However, I was thankful not to be overloaded in this course because the Bayesian analysis course took up all my energy in my second year.
The course on Bayesian statistics was a whole other animal. I read a lot of books that were not assigned as required readings (mostly, Gelman et al’s BDA3, and Lunn et al, but also Lynch’s excellent textbook). I did all the three exercises that were assigned (these are graded but do not count for the final grade). My scores were 20/20, 22/30, 23/30. I never really understood what exactly led to those points being lost; not much detailed explanation was provided. One doesn’t know how many marks one loses for making a figure too small, for example (I was following Gelman’s example of showing lots of figures, which requires making them smaller, but evidently this was frowned upon). As is typical for this degree program, the grading is pretty harsh and tight-lipped (the harsh grading is not a bad thing; but the lack of information on what to improve in the answer was frustrating).
The Bayesian lecture notes could be improved. They seem to have a disjointed feel; perhaps they were written by different people. The Bayesian lecture notes were very different than, say, the linear modeling notes, which really drilled the student on practical details of model fitting. In the Bayesian course, there were sudden transitions to topics that fizzled out quickly and were never resurrected. An example is decision theory; one section starts out defining some basic concepts, and then quickly ends. Inference and decision theory was never discussed. There were sections that were in the notes but not needed for the exams; for an MSc level program I would have wanted to read that material (and did). I had some questions on these non-examinable sections, but never could get an answer, which was pretty frustrating.
The biggest thing that could be improved in these lecture notes is to provide more contact with code. Unfortunately, WinBUGS was introduced, and very late in the course, and then a fairly major project (which counts for the final grade) was assigned that was based entirely on modeling in WinBUGS. Apart from the fact that WinBUGS is just not a well-designed software (JAGS or Stan is much better), not much practice was given in fitting models, certainly not as much as was given for linear modelling. Model fitting should be an integral part of the course from the outset, and WinBUGS should be abandoned in favor of JAGS.
If I had not done a lot of reading on my own, and not learnt JAGS and Stan, I would have really suffered in this course. Maybe that’s what the lecture notes are intending to do: it’s a graduate-level course, and maybe the expectation is that one looks up the details on one’s own.
As it was, I enjoyed doing the Bayesian exercises, which were very neat problems—just hard enough to make you think, but not so hard that you can’t solve them if you think hard and do your own research.
One thing that was never discussed in the Bayesian data analysis course was how to do statistical inference, for example in factorial $2times 2$ repeated measures designs. Textbooks on Bayesian methods don’t discuss this either; perhaps they consider it enough that you get the posterior; you can draw your own conclusions from that.
I got scores in the mid 60s for each course. I think I had 63 in Data Analysis and 67 in Inference.
The third year
The third year courses were MAS6011 (Dependent data) and MAS6012 (Sampling, Design, Medical Statistics). There is a 3 hour exam for each course.
The dependent data course was/is truly amazing. It was here that I finally got to grips with multivariate analysis, and with some interesting data mining type of tools such as PCA. The lecture notes could have been a lot more detailed for a graduate program; the lack of detail was due to the fact that undergrads and grad students were mixed in in the same class.
The Medical Statistics course was fascinating because it was here that one finally saw issues being dealt with where people’s lives would be at stake depending on the answer we obtain. One amazing fact I discovered is that Pocock 1983 considers power below 70% in an experiment to be unethical. Psycholinguists and psychologists routinely run low power studies and publish their null results in prestigious journals. Luckily nobody will die as a result of these studies!
The medstats lecture notes were not that well written, with not much detail, full of typos and bullet point type presentations. These lecture notes need a major overhaul in my opinion. I didn’t get any detailed feedback on the first two exercises I submitted, and the feedback I did get I could not read as it was handwritten with one of those ball-point pens that don’t steadily deliver ink.
There’s also a thesis to be written as part of the MSc; that counts for 50% of the MSc. I would have preferred to do more coursework than do the thesis, but I can see why a thesis is required (all our programs in Potsdam require them too).
General comments/suggestions for improvement:
1. The MSc currently has three specializations: Statistics, Medical Statistics, and Financial Statistics. Each has slightly different requirements (e.g., for Financial, you need to demonstrate specific math ability). I would add a fourth specialization, to reflect the needs of statisticians today. This could be called Computational Statistics or something like that.
In this specialization, one could require a background in R programming, just as Financial Stats requires advanced math. One could replace Stats lab and Data Analysis with a course on Statistical Computing (following some subset of the contents of textbooks like Eubank et al, Eddenbeutel, Cortez, Hadley), and Statistical Learning (aka Data Mining), following a textbook like James et al. I am sure that such a specialization is badly needed; see, for example, the puzzled question asked by a statistician not so long ago in AMSTAT news: Aren’t we data science? One can’t prepare statisticians as data scientists if they don’t have serious computing ability.
Some of the data mining related materials turns up in Dependent Data in year 3, and that’s fine; there is much more that one needs exposure to today. For me, the Stats Lab and Data Analysis courses did not have enough bang for the buck. I can see that such courses could be useful to newcomers to R and data analysis (but at the grad level, I find it hard to believe that the student would have never seen R; I guess it’s possible).
But these courses didn’t really challenge me to deal with real-life problems one might be likely to encounter as a future statistician (writing one’s own packages, solving large-scale data mining problems). If there had been a more computationally oriented stream which assumed R, I would have taken that route.
Some MS(c) programs with the kind of focus I am suggesting:
a. St Andrews: http://www.creem.st-and.ac.uk/datamining/structure.html
b. Another one in Sweden: http://www.liu.se/utbildning/pabyggnad/F7MSM/courses?l=en
c. Stanford: https://statistics.stanford.edu/academics/ms-statistics-data-science
2. The lectures could have easily been recorded, this would have greatly enhanced the quality of the MSc. All you need is slides and a screen capture software with audio recording capability.
3. The real value added in the MSc is the exercises, and the feedback after the exercises have been submitted. This is the only way that one learns new things in this course (apart from reading the lecture notes). The written exams are of course a crucial part of the program, but the solutions and one’s own attempt are never released so one has only a limited opportunity to learn from one’s mistakes in the exam. For 2000 pounds a year, this is quite a bargain. Basically this is equivalent to hiring a statistician for 33 hours at 60 pounds an hour each year, with the big difference that you leave the table knowing much more than when you arrived.
4. Some ideas that were difficult for me:
– Expectation of a function of random variables was taught in the grad cert in 2011, but I needed it for the first time in 2014, when studying the EM algorithm. It would have been helpful to see a practical application early.
– The exponential distribution is a key distribution and needs much more study, esp. in connection with modeling survival. Perhaps more time should be spent studying distributions and their interrelationships.
– The derivation of full conditional distributions could have been tightly linked to DAGs, as is done in the Lunn et al book. It was only after I read the Lunn et al book that I really understood how to work out the full conditional distribution in any (within reason) given Bayesian model.
– I learnt how to compute eigenvalues and eigenvectors in the graduate certificate, but didn’t use this knowledge until 2014, when I did Multivariate Analysis. I didn’t even understand the relevance of eigenvalues etc. until I saw the discussion on Principal Components Analysis. A tighter linkage between mathematical concepts and their application in statistics would be useful.
– Similarly, Lagrangian multipliers became extremely useful when we started looking at PCA and Linear Discriminant Analysis; I saw them in 2011 and forgot all about them. There must be some way to show the applications of mathematical ideas in statistics. After much searching, I found this useful book that does part of the job:
5. The entire MSc program basically provides the technical background needed to understand major topics in statistics; there is not enough time to go into much detail. Each chapter in each course could have been a full course (e.g., the EM algorithm). I think that the real learning will not begin until I start to apply these ideas to new problems (as opposed to, say, using already known routines like linear mixed models). So, what I can say is that after four years of hard work, I know enough to actually start learning statistics. I don’t feel like I really know anything; I just know the lay of the land.
6. The MSc is heavily dependent on R. Not having a python component to the course limits the student greatly, especially if they are going to go out there into the world as a ”data scientist”. The Enthought on-demand courses are a fantastic supplement to the MSc coursework. It would be a good idea to have a python course of that type in the MSc coursework as well.
7. One mistake I made from the perspective of exam-taking was not to spend enough time during the year using the hand-calculator (actually, I spent no time on this). In the exam, the difference between a distinction and an upper second can be the speed with which you can compute (correctly!) on a calculator. I am terrible at this, rarely even able to do simple calculations correctly on a hand-held (I’m talking about really basic operations), simply because I don’t use calculators in real life; who does? I would have much preferred exams that test analytical ability rather than ability to do calculations quickly on a calculator. In the real world one uses computers to do calculations anyway. I was also hindered by the fact that I am half-blind (a side effect of kidney failure when I wa 20) and can’t even see the hand-calculator’s screen properly.
8. One peculiar aspect, and this permeated the MSc program, was the fairly antiquated instructions to students for using LaTeX etc. I think that statisticians should lead the way and use tools like Sweave and Knitr.
9. The textbook recommendations are out of date should be regularly revised. The best textbooks I found for each course that had exams associated with it:
Linear modelling: An Introduction to Generalized Linear Models, Dobson et al
Dobson et al is the best textbook I have ever read on generalized linear models, bar maybe McCullagh and Nelder. Dobson et al was a recommended book in the linear modeling course, a very good choice.
Bayesian Statistics: Lynch, Lunn et al, BDA3, Box and Tiao
Lynch is the best first book to read for Bayes (if you know calculus), and Lunn et al is very useful indeed, and beautifully written. It prepares you well for doing practical data analysis. Unfortunately, it’s oriented towards WinBUGS, but one can translate the code easily to JAGS. In my opinion, WinBUGS was a great first attempt, but it should be retired now, because it is just so painful to use. People should go straight to JAGS (thanks to Martyn Plummer for doing just a fantastic job with JAGS) and then (or alternatively) Stan (thanks to Bob Carpenter, Andrew Gelman and the Stan team for making it possible to use Bayes for really complex problems). You really need both JAGS and Stan in order to read and understand books, especially if you are just starting out.
I recommend reading Box and Tiao at the very end, to get a taste of (a) outstanding writing quality, and (b) what it was like to do Bayes in the pre-historic era (i.e., the 1970s).
Computational Inference: Statistical Computing with R, Rizzo
This book covers pretty much all of computational inference in a very user-friendly way,
Multivariate Analysis: Mathematical Tools for Applied Multivariate Analysis, By Carroll et al.
This book is very heavy going and not an after-five kind of book, it needs serious and slow study. I used it mostly as a reference book.
Medical Statistics (Survival Analysis): Regression Modeling Strategies by Harrell, and Dobson et al. I found the presentation of Survival Analysis in Harrell’s book particularly helpful.
This MSc program is very valuable for someone willing to work hard on their own, with rather variable amounts of guidance from the instructors. It provides a lot of good-quality structure, and it allows you to check your understanding objectively by way of exams.
Doing this MSc changed a lot of things for me professionally:
– I rewrote my lecture notes, abandoning the statistics textbook I had written in 2011. The Sheffield coursework played a huge role in helping me clean up my notes. I think these notes still need a lot of work, and I plan to work on them during my coming sabbatical.
-I started teaching undergrad Math as a prerequisite to my more technically oriented stats courses.
– I started teaching Bayesian statistics as a standard part of the graduate linguistics coursework. There doesn’t seem to be much interest among most linguistics students in this stuff, but I do attract a very special type of student in these classes and that makes teaching more fun.
– I started teaching linear (mixed) modeling in a way aligns much more with standard presentations in the Sheffield MSc program.
– At least one of my students has taken advantage of Bayesian methods in their research, so it’s starting to have an impact.
– One thing that became clear (if it wasn’t obvious already) is that becoming a professional statistician or at least acquiring professional training in statistics is a necessary condition to doing analyses correctly, but it isn’t a sufficient condition. Statisticians usually are unable to address concerns from people in specific areas of research because they have no domain knowledge. It seems that without domain knowledge, statistical knowledge is basically useless. One should not go to statisticians seeking “recommendations” on what to do in particular situations. Depending on which statistician you talk to, you can get a very variable answer. Coupled with knowledge of your research area and knowledge of statistical theory (which of course you have to acquire, just as you acquired your domain knowledge), you have to work out the answer to your particular problem.
– I have essentially abandoned null hypothesis significance testing and just use Bayesian methods. The linear modeling and Bayesian statistics plus computational inference courses were instrumental in making this transition possible. I still report p-values, but only because reviewers and editors of journals insist on them.
– I run high-powered studies whenever possible (e.g., it’s not possible to run high power studies with aphasic populations, at least not at Potsdam). Everything else is a waste of time and money.
– I started posting all data and code online as soon as the associated paper is published.
-I spend a lot of time visualizing the data and checking model assumptions before settling on a model.
– I use bootstrapping a lot more to check whether my results hold up compared to more conventional methods.
– I try to replicate my results, and try to publish replications both of my own work and of others (much more difficult than I anticipated—people think replication is irrelevant and uninformative once someone has published a result with p less than 0.05.
– I can understand books like BDA3. This was not true in 2011. That was the biggest gain of putting myself through this thing; it made me literate enough to read technical introductions.