This was the kind of problem that motivated me to look for a full course of study that would give me the background that’s missing.
In September 2011, I started a graduate certificate in statistics, at Sheffield (my wife’s alma mater, coincidentally). preparatory course, which gets you ready to start an MSc in statistics. It’s a nine-month affair with three courses (Math, Probability, Statistics) which are taught more or less in parallel (they cleverly stagger the assignment submissions so that one is only working on one of the three at any one time). The course guide says it’s a time commitment of 15 hours a week, which seems amazingly little (imagine that you devote the last three hours of your day to working on something like this—you could easily exceed 15 hours a week). But it’s a realistic number; I think I spent about that much time every week on it. Occasionally I have spent more (when I ran into trouble).
Short version: the course is *excellent*. I wish I had done this years ago. But I would recommend it only to people who are really willing to sweat it out. You have to be able to do your own research when you need more detail on a particular topic. The instructors can be helpful, but the interaction is through a mailing list, and response lag can be several days. Quite often, though, the instructors’ responses did not help me much and I had to do my own research. I think the main positive aspect of this course is that it spells out what you need to know and in what order. If you look at the vast space that is calculus, linear algebra, probability theory, and statistics, such guidance is extremely helpful.
Preparation for the course: I wish I had reviewed permutations and combinations, and trigonometry before starting the course. I found it the hardest to recall these things while working through the course. I would recommend the following sites for trig:
For permutations and combinations, this is badly formatted but it could be useful: http://www.bmlc.ca/PureMath30/Pure%20Math%2030%20-%20Permutations%20and%20Combinations.html
There are three modules:
1. Math: http://maths.dept.shef.ac.uk/maths/module_info.php?id=985
2. Probability: http://maths.dept.shef.ac.uk/maths/module_info.php?id=987
3. Statistics: http://maths.dept.shef.ac.uk/maths/module_info.php?id=990
Each assignment I did (a total of 18 to be assigned by the time the course ends; only the last 15 are graded) took a lot of time, and if you are typesetting the submission in LaTeX, you should expect a full day to be devoted to that (at least). I gave up typesetting—to my great regret—in the math and probability courses towards the end of the course because it takes too long.
Most of the questions directly use knowledge that was recently covered in the course, but some of the questions require thought and insight (which takes time).
I wish the assignments had been prepared with greater attention to precision in writing. A couple of times I had a hard time trying to figure out what precisely was meant, and judging from the others’ questions, I was not the only one. I should say that it’s inherently hard to write good assignments (my own students sometimes suffer due to lack of clarity in assignments I sent for them).
The grading is tough, but I kind of like it that the graders are so reluctant to give a point. You have to respect them for that. When I saw how they grade, I felt a bit ashamed that I am so soft on my students in Potsdam. These guys are really hard-nosed. I guess I’ve gone soft after eight years in a soft-skills (no pun intended) oriented linguistics department. Computer science and math at Ohio State was about as strict, although the math guys at OSU were a bit more relaxed, but only a bit.
My grades so far are as below (percentages). This excludes the first assignments; they do not count for the final grade. In general, these assignments count for 20% of the final grade. You have to get minimum 65% overall in the final grade (after the exams in June 2012) to be allowed to continue on to the MSc.
Average score of all homework assignments: 84.87%
Final exam scores: unknown.
Mathematics: mean score 93.4%
## Notes: I was pleasantly surprised to get this grade, but I do admit the problems were not too hard. The hardest part of this course is not the math but the probability theory.
## Made a stupid mistake. Don’t ask.
## Here, I lost a lot of points for not generalizing my solution for a particular proof, and I made a stupid mistake in a partial derivative computation (forgot to treat y as a constant when differentiating with respect to x) that caused a snowball-effect error in the final answer, and I thought that level curves are three-dimensional and plotted them as such. All errors could have been avoided if I had carefully re-read what I’d done, but there was no time (this was also the only submission so far which was not typeset, perhaps another reason why I didn’t notice the errors—when typesetting I usually discover a lot of mistakes in my solutions, esp. since I check my solutions at that stage with R or a computer algebra system like yacas).
## Several people got 100% I believe. It was relatively easy because it was about really basic linear algebra mostly.
## The first quartile was 83% it seems, and the third was 97%, so I clearly did not do well. I made some amazingly stupid arithmetic errors (like putting in a negative sign where there was supposed to be none), and it just adds up (or subtracts off). I had a hard time with one problem involving the computation of eigenvectors, and with a change of variables problem. I also made a stupid mistake in a double integral (miscopying one line onto the next, how dumb is that). All costly mistakes.
Probability: mean score 78.8%
## Notes: I did relatively badly in the first assignment because I reversed a couple of signs accidentally, and because I didn’t leave enough time for working on the assignment. I lost a lot of points on a stupid and mindless exercise involving reading numbers off of a binomial probability table — shame on me.
## Notes: The instructor cut 4 percentage points when I said that a particular distribution (I discovered later that this was the Cauchy distribution) had expectation infinity. He said I should have said “expectation does not exist.” He’s right, of course, but it was a painful loss, given that I didn’t know at that time that infinity cannot be considered to exist in this particular context. That’s a painful lesson.
## Notes: I lost a lot of points for small things, like not defining a random variable X before I mention it (I thought it was clear from context, but I see that technically one has to define everything—God, I am out of touch with formal proofs and such like). The moral was that I have to write a lot more precisely. Perhaps there will be an improvement in the next submission’s grades.
## I made lots of small errors, and they add up.
## Very hard assignment, done in a hurry because exams were imminent. Everyone’s grades suffered. 64% was above average.
Statistics: mean score 82.4%
### It is amazing that the one thing I thought I could do—data analysis—is
## giving me the most trouble in this course. The grading is pretty harsh; for example, if you report a p=0.30 and don’t say explicitly that you reject the null hypothesis, you lose points. I did make some horrible mistakes in this assignment though, so I do deserve losing some marks.
## I got almost everything right, but I lost most of those 12% points for not defining \mu whenever I set up the null hypothesis. I was a bit crestfallen that most of those 12% points were cut for a repeated failure to define \mu (I felt it was clear from context what mu was in each case…).
## I’m actually surprised I do so badly in the stats segment, compared to the other ones (math and probability).
## Also done in a hurry. I don’t know how the others did.
One minor gripe I have in the course is some of the textbooks assigned. I think the course designers should probably invest some time into building complete course-specific lecture notes. Sometimes they do provide lectures notes, and these are great (my only complaint is that the authors don’t believe in page numbers, which can be a real hassle if you print out the material and accidentally drop them on the floor). But that said, I realize that producing customized lecture notes is a major undertaking and I don’t blame them for relying on existing books.
The math textbook is by Gilbert and Jordan, Guide$^2$ Mathematical Methods. The title is a bit strange, I mean the 2 instead of “to” (although the authors cannot be blamed for this naming decision—apparently Macmillan has a Guide 2… series). The book has one positive aspect: it covers the relevant material in the sense that it goes through the topics. So, for someone like me, who doesn’t know exactly what I need to know as background to read more advanced textbooks on statistics, this is a good extended listing of the things I should know (or recall from high school). This is all good. What makes the reader’s life hard are the super-terse proofs/solutions to exercises, and the large number of typos (especially in the solutions). The course organizers released a list of typos for the book at the start of the course, but there are even more typos than in the errata. The notation can also be sloppy and the reader has to be careful (e.g., at one point they write F-p, where F are the *names* of a function; what they meant was F(x)-p(x)).
An example of the terseness is the proof that the limit of sin(x)/x when x approaches 0 is 1. I had no idea where that proof was going until I watched Strang’s online lecture (MIT open courseware). After watching Strang’s lecture, I was able to unpack the proof myself, but I doubt that it could be done easily by just working through it (are proofs meant to be hard work to unpack? Read the Salas et al book on Calculus to see that the answer may be no). Gilbert and Jordan should read Knuth’s book on writing mathematics. I would recommend Strang’s lectures, and Calculus by Salas, Hille and Etgen (I have the 9th Edition); this last book really nailed it for me. This book has the smoothness and feel of Cormen, Rivest et al on Algorithms.
The probability theory book (Ross, A First Course in Probability) is OK in that it covers all the points. But it has the irritating property that each definition is followed by half a dozen totally unrelated examples. This in itself is not bad, but the examples are SCARY. I’m not sure I need to see the toughest applications of the latest idea learnt right away. After a bit of reading this sort of teaching-by-example at least this reader just gets depressed (there is no way I would have worked out those example answers myself after just reading the definitions provided immediately earlier). I found the online book by Jay Kerns much, much more useful for the present course—the course organizers should consider switching to that or at least assigning it alongside Ross, with some warnings to not get intimidated by Ross’ style. Here is an excellent review of the Ross book on amazon that pretty much summarizes the main problems in the book.
There is an accompanying book on probability (by Freund, but authors are Miller and Miller) which is a straightforward and formal introduction to mathematical statistics; I like that more. It’s part of the statistics course, not the probability course.
The statistics textbook is by Moore, McCabe and someone else, is absolutely terrible (apparently I’m not the only one complaining; see here). The book seems to be written for first year undergrads in statistics (nothing wrong with that of course, but this graduate certificate has a different audience). The large number of disconnected and silly examples (for example, for planned experiments vs observational studies) following every new concept lead to a feeling of total disorientation. There is also a painful attempt to make the book relevant to the modern user: examples about cell phones and iPhone Apps abound, presumably to draw in the young reader’s eyes away from the cell phone as they read. The book could be a lot slimmer if all the extraneous junk is removed and they just stick to the facts presented.
There are so many really good books on introductory statistics using R (e.g., in the Use R! series); I wish they had used one of them. As it is, you have to be either real good at R, or be able to quickly get on board with R, if you want to do this course. Since the course is completely based on R, it is absolutely wonderful for me, but several of the students went into a state of blind panic (for example, a beginner often cannot easily figure out how to find out how to change a directory within R—in my own courses at Potsdam, I think we spend about 90 minutes just getting them used to the interface). Using an R-based book for introductory statistics would have been much better than Moore et al’s. I have to admit though that I cannot name an alternative to Moore et al’s right away that covers exactly the same material. I would like the book a lot more if there was a version for grown-up people: no photos of people holding cell phones, no extended and long-drawn out examples, just the facts.
I have concluded that the linear algebra books assigned in this course suck. They never tell you why one should care about such and such fact, and they overload the book with proofs. Also, linear algebraists apparently think they have a great sense of humor. Both Allenby and Lay, the former a bit more, deliver lots of intended-to-be-funny comments followed by an exclamation point. Lay can be very clear, though.
If anyone wants their first contact with linear algebra to be not painful, they should read Leonard Evens’ excellent online book, which he has the generosity to release for free: http://www.math.northwestern.edu/~len/LinAlg/index.html. (I find it embarrassing that he wrote such a beautiful book and released it for free, whereas I have Springer charge for mine. Never again.) On the other hand, all these people writing expensive linear algebra texts should also be feeling a bit embarrassed to be out-performed by a free textbook (my super-expensive Lay textbook doesn’t even have all the pages, some 10-20 pages are simply not there; the publishers apparently screwed up, and the table of contents page numbering bears no relationship to the actual numbers, so the table of contents is not only useless but actually misleading). Evens released his book with essentially no restrictions, with the source code. I really admire that.
I also found Gilbert Strang’s book and online lectures on matrix algebra very cool. Denis Auroux’s lectures are also quite amazing.
One thing worth noting about this course—I was not prepared for this—is that the last five lectures (out of 40) in each module (math, probability, and statistics) are the most demanding. I really had to sweat over this part of the course in a way I never had to for the earlier parts. Part of the reason is that these last five lectures deploy a lot of the material you learnt in the last 35 lectures, and obviously I don’t have everything in my head and easily retrievable, so it was hard going to try to recall, for example, how to differentiate this or that function.
The concepts involving the method of moments, maximum likelihood estimates, likelihood ratio tests (in statistics); linear algebra, double integrals using polar coordinates and change of variables (math), and such like things were really overwhelming as they came in all at once towards the end. It doesn’t help that the textbooks give pretty obscure discussions about this; in this part of the course I really had to google my way through these topics, by watching MIT Opencourseware lectures and reading dumbed down versions of these topics. What I missed the most in the textbooks was the why: why are we doing this, in the sense that where is this going? These online resources explain this mysterious aspect very well. Normally, it’s almost anti-intellectual to ask a question like “why are we learning this?”, but here, I really needed to know where I am going when I do an LRT or use the Neyman-Pearson Lemma, for example, or why a null space is called a null space (Evens answers that question). What’s most remarkable though is the sheer range of resources available on the web to answer these questions. A lot of people spent a lot of their valuable time helping out poor suckers like me, who just don’t get it. MIT Opencourseware is to be congratulated for releasing (for free) videos of so many important lectures on math and linear algebra.
Some other minor gripes about the course
1. In one of the pages of Ross’ book (our assigned text), he writes “…if \sigma=\infty….”. Now, in one of the probability theory assignments (Assignment 3) I lost four percentage points for saying that E[X]=\infty. I lost marks because I should have written, “the mean does not exist.” This is correct, you do have to write exactly that. But when I pointed out that Ross makes the same mistake, I was told that that was a slight abuse of notation but it’s fine. Seems like if a statistician writes something incorrectly, it’s OK, but not if a student writes the same thing (maybe this makes sense, if you think about it). If I had seen Ross’ statement before writing my assignment solution, would I still deserve to lose 4 percentage points? I found this double standard irritating. I have to admit that I am just whining about losing those four points (and these four points contribute almost nothing to the final grade, as homework assignments count for only 20 percent of the final grade), and so I’m perhaps just being a sore loser.
2. The statement of some of the probability theory problems in the homework assignments was very unclear. It’s hard to write unambiguously, that’s understandable; but I would have expected clearly worded problems. Even worse, the clarifications one got after asking for more detail led to so much confusion among us students (at one point the lecturer was contradicting an earlier statement) that in one case we just gave up and went with our interpretation, which seemed according to the messages from lecturer to be wrong (I will report later if our interpretation got full marks or not; it turns out my own interpretation was correct, I got full marks). I was not the only one facing this problem; there was a flurry of unhappiness about the question. This is true for occasional other assignments (I would say in maybe 2% of all cases there was some ambiguity, so it’s definitely not a serious issue). For example, in one of the stats assignments we have a garden-path that fooled a student:
“Compare this with what happens if you first include the interaction with the residual…”
The author of this assignment intended a non-local attachment of the prepositional phrase “with the residual” to “compare”, not to “interaction”. The student wanted to know how to compute “the interaction with the residual”. A more common class of error is scope ambiguities (usually fatal ones). Maybe mathematicians and statisticians need to study formal semantics and syntax!
3. Too often, the response to clarification questions can take up to a week or more; this is simply too long a period for a course moving as fast as this one. In some cases, I just had to go with what I understood. This isn’t too serious; one often leaves a course with many questions unanswered, and after all this course is just a prep course for the real thing, the MSc. But it could be optimized by asking the lecturers to check the message board at least once a day or every other day.
The final exams were designed with the following assumptions in mind: (a) you are not expected to finish them on time unless you could immediately solve every problem without thinking much about it, or thinking extremely fast, (b) you can compute on the calculator extremely fast, (c) you have everything on your fingertips (the exam is open book, but one does not really have time to look things up). I think I’ll be lucky if I make the 65% “passing” grade (passing in the sense that one can proceed to the MSc).
One mistake I made was that I should have regularly reviewed every single topic incrementally as the course progressed, and I should have kept doing exercises on old topics (e.g., integration techniques), so that I would not forget details from a few months ago. Next time I will be a lot more systematic in doing revision. Mathematicians are not kidding when they say you have to practice every single day; it’s no different from playing the violin.
Overall, a thumbs-up. This is a course every non-statistician who needs to work with data should take. Even in these few months I learnt a lot of interesting and even downright cool things (mostly in the math segment, but also in probability theory).
The big advantages of doing this kind of structured course are that:
– you have to solve problems on a daily basis in order to the get the assignments done on time, and someone carefully checks your work. If you try to read books on topics that are specifically relevant to you, like Gelman recommends, you are not going to get that quality of feedback (no, not even with a solutions manual).
– you can ask a statistician questions that come up as you read or work on real problems that affect your own life, and they will often take the time to answer them fully. This is virtually impossible if you just try to talk to a random statistician (generally, they either heap scorn on you, or give a rambling answer that doesn’t really answer the question, because they just don’t want to pay attention long enough to try to understand the problem—not that I blame them for that; why should they care what your problem is?).
Some of the material I consulted while doing this course (incomplete):
1. On writing math
2. Kerns’ book on probability
3. Grinstead and Snell on probability
4. Salas et al Calculus
5. Spivak Calculus
6. MIT Open courseware (Strang, Auroux on calculus and linear algebra)
I consulted many other books, I will put a list online one of these days.
Software I used in this course:
2. Yacas (with Ryacas and without), to check my answers, arrived at analytically.
3. Mathematica to check my solve-for-theta type of solutions.
4. Matlab (I forgot why I used Matlab instead of Mathematica, but I did).
So, now I know almost everything they taught in this course. This is supposed to be (almost?) equivalent to an undergrad degree in statistics, but I doubt that, because one cannot learn in nine months what others have spent three years learning. But I do know enough to move on to more advanced texts. After some asking around (Sheffield instructors) and doing some research, I concluded that I need to read two books completely:
1. Salas et al on Calculus; Strang and Auroux’s lectures on calculus.
2. Shayle Searle’s “Matrix Algebra useful for Statistics” (the graduate certificate teaches you almost everything in this book, but it’s a nice review nonetheless, and talks about some details not covered in the course)
3. James Gentle on Matrix Algebra. Gentle’s book seems to be a classic, but it’s hard going so be prepared to read slowly.
4. Strang’s lectures on calculus and matrix algebra (Strang’s book on Matrix Algebra is also a pleasure to read, you can feel his personality shine through his words).
This material will cover pretty much all the math I would need for a good understanding of statistical theory. This is an incomplete list, of course, and it’s based on my own conclusions about what is needed, so it may not even be the right list.