Two textbooks on probability using R

June 18, 2011
By

(This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers)

This fall, I’ll be teaching a second-year course on Probability with Computer Applications, which is required for Computer Science majors.  I’ve taught this before, but that was five years ago, so I’ve been looking to see what new textbooks would be suitable.  The course aims not just to use computer science applications as examples, but also to reinforce concepts of probability with programs, and to show how simulation can be used to solve problems that aren’t easily solved analytically. I’ve used R for the programming part, and plan to again, so I was naturally interested in two recent textbooks that seemed to have similar aims:

Introduction to Probability with R, Kenneth Baclawski, Chapman & Hall / CRC.

Probability with R: An Introduction with Computer Science Applications, Jane M. Horgan, Wiley.

I’ve now had a look at both of these textbooks.  Unfortunately, they are both seriously flawed.  Even more unfortunately, although some of the flaws in these books are particularly striking, I’ve seen similar, if usually less serious, problems in many other textbooks.

Introduction to Probability with R seemed from the title and blurb to be quite promising.  I began looking at it by just flipping to a few random pages to see what it read like.  That wasn’t supposed to be a serious evaluation yet, but in only a couple minutes, I’d found two passages that pretty much eliminated it from further consideration.  Here is the introduction to parametric families of distributions on pages 56 and 57:

… The distributions within a family are distinguished from one another by “parameters”… Because of the dependence on parameters, a family of distributions is also called a random or stochastic function… Be careful not to think of a random function as a “randomly chosen function” any more than a random variable is a “randomly chosen variable.”

Now, I’ve never before seen a parametric family of distributions called a “random function” or “stochastic function”.  And I’ve quite frequently seen “random function” used to mean exactly a “randomly chosen function”.  Where the author got his terminology, I’ve no idea.  For good measure, this same passage refers to the p parameter of a binomial distribution as the “bias”, and has a totally pointless illustration of a bunch of gears grinding a value for n and a “bias” into a distribution.

So, maybe he uses non-standard notation, but could the content be good?  Here’s what’s on page 131:

Main Rule of Statistics.  In any statistical measurement we may assume that the individual measurements are distributed according to the normal distribution, N(m2).

To use this rule, we first find the mean m and variance σ2 from information given in our problem or by using the sample mean… and/or sample variance… defined below. We then compute using either the pnorm or the qnorm function.

As stated, the main rule says only that our results will be “reasonable” if we assume that the measurements are normally distributed. We can actually assert more. In the absence of better information, we must assume that a measurement is normally distributed. In other words, if several models are possible, we must use the normal model unless there is a significant reason for rejecting it.

(The author continues his avoidance of standard terminology in the passage above by denoting the sample mean of x1, …, xn by m with a bar over it and the sample variance by σ2 with a bar over it, which I haven’t tried to reproduce.)

Sometimes, an occasional incorrect passage in a textbook does no harm, if corrected, and can even make for an interesting example in lecture, but when a textbook emphatically states completely erroneous nonsense it would be a disservice to students to force them to buy it.  The passage above reads like a parody of a statistics textbook — too many of which say how important it is to think carefully about what model is appropriate for your problem, but then proceed to use a normal distribution in all the examples without any discussion.  Still, lip-service to good practice is better than nothing, and much better than insistent advocacy of bad practice.

Probability with R: An Introduction with Computer Science Applications seemed from its title and blurb to be even closer to what I need for my course.  For this book, two minutes of flipping through it did not provide sufficient grounds for rejection, so I started looking at it more closely, including reading it systematically from the beginning.  I found lots of careless errors (some corrected in the on-line errata), and lots to quibble about, along with nothing that was particularly impressive, but it wasn’t until page 83 that I encountered something seriously wrong:

Returning to the birthday problem …, instead of using permutations and counting, we could view it as a series of k events and apply the multiplication law of probability.

Bi is the birthday of the ith student.

E is the event that all students have different birthdays.

For example with two students, the probability of different birthdays is that the second student has a birthday different from that of the first,

$P(E)\ =\ P(B_2|\overline B_1)\ =\ 364/365$

that is, the second student can have a birthday on any of the days of the year except the birthday of the first student.

With three students, the probability that the third is different from the previous two is

$P(B_3|\overline B_1 \cap \overline B_2)\ =\ 363/365$

that is, the third student can have a birthday on any of the days of the year, except the two of the previous two students.

The example continues like this, with equations that on the right side have numerical values that are correct (in terms of the subsequent explanation in words), while on the left side are probability statements that are complete nonsense — since of course Bi, “the birthday of the ith student” is not an event. Nor can I see any alternative definition of Bi that would lead to these probability statements making sense.

Maybe “anyone” could make a mistake like this, but maybe not  — I do wonder whether the author actually understands elementary probability theory.  I lost all confidence in her ability to apply it in practice on reading the following on page 192, in the section on ”Machine learning and the binomial distribution”:

Suppose there are three classifiers used to classify a new example. The probability that any of these classifiers correctly classifies the new case is 0.7, and therefore 0.3 of making an error. If a majority decision is made, what is the probability that the new case will be correctly classified?

Let X be the number of correct classifications made by the three classifiers. For a majority vote we need X≥2. Because the classifiers are independent, X follows a binomial distribution with parameters n=3 and p=0.7…

We have improved the probability of a correct classification from 0.7 with one classifier to 0.784 with three…

Obviously, by increasing the number of classifiers, we can improve classification accuracy further…

With 21 classifiers let us calculate the probability that a majority decision will be in error for various values of p, the probability that any one classifier will be in error…

Thus the key to successful ensemble methods is to construct individual classifiers with error rates below 0.5.

No, I haven’t omitted any phrase like “assuming the classifiers make independent errors”.  The author just says “because the classifiers are independent” as if this was totally obvious.  Of course, in any real ensemble of classifiers, the mistakes they make are not independent, and it is not enough to just produce lots of classifiers with error rates below 0.5.

Many, many textbooks give examples in which they say things like “assume that whether one patient dies after surgery is independent of whether another patient dies” without considering the many reasons why this might not be so.  But at least they do say they are making an assumption, and at least it is possible to imagine situations in which the assumption is approximately correct.  There is no realistic machine learning situation in which multiple classifiers in an ensemble will make errors independently, even approximately. The example in the book is totally misleading.

As well as being seriously flawed, neither of these books makes particularly good use of R to explain probability concepts or to demonstrate the use of simulation. The fragments of R code they contain are very short, basically using it as a calculator and plot program.

So, that’s it for these books.  If any readers know of good books on probability that either have good computer science applications, or use R for simulations and to clarify probability concepts, or preferably both, please let me know!