The foundations of Statistics: a simulation-based approach

Posted on July 11, 2011 by xi'an in R bloggers, Uncategorized | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“We have seen that a perfect correlation is perfectly linear, so an imperfect correlation will be `imperfectly linear’.” page 128

This book has been written by two linguists, Shravan Vasishth and Michael Broe, in order to teach statistics “in areas that are traditionally not mathematically demanding” at a deeper level than traditional textbooks “without using too much mathematics”, towards building “the confidence necessary for carrying more sophisticated analyses” through R simulation. This is a praiseworthy goal, bound to produce a great book. However, and most sadly, I find the book does not live up to expectations. As in Radford Neal’s recent coverage of introductory probability books with R, there are statements there that show a deep misunderstanding of the topic… (This post has also been published on the Statistics Forum.)

“The least that you need to know about is LaTeX, Emacs, and Emacs Speaks Statistics. Other tools that will further enhance your working experience with LaTeX are AucTeX, RefTeX, preview-latex, and Python.” page 1

The above recommendation is cool (along with the point that these tools are “already pre-installed in Linux”, while, for Windows or Macintosh, users “will need to read the manual”!) but eventually rather daunting when considering the intended audience. While I am using LaTeX and only LaTeX in my everyday work, the recommendation to learn LaTeX prior to “understand the principles behind inferential statistics” sounds inappropriate. The book clearly does not require an understanding of LaTeX to be read, understood, and practiced. (Same thing for Python!)

The authors advertise a blog about the book that contains very little information. (The last entry is from December 2010: “The book is out”.) This was a neat idea, had it been implemented.

“Let us convince ourselves of the observation that the sum of the deviations from the mean always equals zero.” page 5

What I dislike the most about this book is the waste of space dedicated to expository developments that aim at bypassing mathematical formulae, only to provide at the very end of the argument this mathematical formula. And then the permanent confusion between the distribution and the sample, the true parameters and their estimates. (Plus the many foundational mistakes, as those reported below.) If a reader has had some earlier exposition to statistics, the style and pace are likely to unsettle/infuriate her. If not, she will be left with gapping holes in her statistical bases: no proper definition of unbiasedness (hence a murky justification of the degrees of freedom whenever they appear), of the Central Limit theorem, of the t distribution, no mention being made of the Law of Large Numbers (although a connection is made in the summary, page 63). This does not seem a material that is sufficient enough to engage in reading Gelman and Hill (2007), as suggested at the end of the book… Having the normal density defined as the “somewhat intimidating-looking function” (page 39)

$f(x) = \dfrac{1}{(\sigma\sqrt{2\pi})}\,E^{-((x-\mu)^2/2\sigma^2)}$

certainly does not help! (Nor does the call to integrate rather than pnorm to compute normal tail probabilities (pages 69-70)).

“The key idea for inferential statistics is as follows: If we know what a `random’ distribution looks like, we can tell random variation from non-random variation.” page 9

The above quote gives a rather obscure and confusing entry to statistical inference. Especially when it appears at the beginning of a chapter (Chapter 2) centred on the binomial distribution. As the authors seem reluctant to introduce the binomial probability function from the start, they resort to an intuitive discourse based on (rather repetitive) graphs (with an additional potential confusion induced by the choice of a binomial probability of p=0.5, since p^k(1-p)^n-k is then constant in k…) In Section 2.3, the distinction between binomial and hypergeometric sampling is not mentioned, i.e. the binomial approximation is used without any warning that it is an approximation. The fact that the mean of the binomial distribution B(n,p) is np is not established and the variance being np(1-p) is not stated (except in the appendix). (However, the book spends four pages [36-39] showing through an R experiment that “the sum of squared deviations from the mean are [sic!] smaller than from any other number”.)

“The mean of a sample is more likely to be close to the population mean than not.” page 49

The above is the conclusive summary about the Central Limit theorem, after an histogram with 8 bins showing that “the distribution of the means is normal!”… It is then followed by a section on “s is an Unbiased Estimator of σ“, nothing less!!! This completely false result (s is the standard estimator of the standard deviation σ) is again based on the “fact” that it is “more likely than not to get close to the right value”. The introduction of the t distribution is motivated by the “fact that the sampling distribution of the sample mean is no longer be modeled by the normal distribution” (page 55). With such flaws in the presentation, it is difficult to recommend the book at any level. Especially the most introductory level.

“We know that the value is within 6 of 20, 95% of the time.” page 27

I am also dissatisfied with the way confidence and testing are handled (and not only because of my Bayesian inclinations!). The above quote, which replicates the usual fallacy about the interpretation of confidence intervals, is found a few lines away from a warning about the inversion of confidence statements! A warning only repeated later “it’s a statement about the probability that the hypothetical confidence intervals (that would be computed from the hypothetical repeated samples) will contain the population mean” (page 59). The book spends a large amount of pages on hypothesis testing, presumably because of the own interests of the authors, however it is unclear a neophyte could gain enough expertise from those pages to conduct his own tests. Worse, statements like (page 75)

$H_0: \bar x = \mu_0$

show a deep misunderstanding of the nature of both testing and random variables. How can one test a property about the observed sample mean?! A similar confusion appears in the ANOVA chapter (e.g. (5.51) on page 112).

“The research goal is to find out if the treatment is effective or not; if it is not, the difference between the means should be `essentially’ equivalent.” page 92

The following chapters are about analysis of variance (5), linear models (6), and linear mixed models (7). all of which face fatal deficiencies similar to the ones noted above. The book would have greatly benefited from a statistician’s review before being published. (I cannot judge whether or not the book belongs to a particular series.) As is, it cannot deliver the expected outcome on its readers and train them towards more sophisticated statistical analyses. As a non-expert on linguistics, I cannot judge of the requirements of the field and of the complexity of the statistical models it involves. However, even the most standard models and procedures should be treated with the appropriate statistical rigour. While the goals of the book were quite commendable, it seems to me it cannot endow its intended readers with the proper perspective on statistics…