[This article was first published on Statistical Modeling, Causal Inference, and Social Science » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I know next to nothing about golf. My mini-golf scores typically approach the maximum of 7 per hole, and I’ve never actually played macro-golf. I did publish a paper on golf once (A Probability Model for Golf Putting, with Deb Nolan), but it’s not so rare for people to publish papers on topics they know nothing about. Those who can’t, research.

But I certainly have the ability to post other people’s ideas. Charles Murray writes:

I [Murray] am playing around with the likelihood of Tiger Woods breaking Nicklaus’s record in the Majors. I’ve already gone on record two years ago with the reason why he won’t, but now I’m looking at it from a non-psychological perspective. Given the history of the majors, what how far above the average _for other great golfers_ does Tiger have to perform?

Here’s the procedure I’ve been working on:

1. For all golfers who have won at at least one major since 1934 (the year the Masters began), create 120 lines: one for each Major for each year from the year the golfer turned 20 through the year he turned 49. Here’s the draft I use to explain this:

In trying to estimate how well golfers do after their mid-thirties, we’re not interested in the numbers of wins per attempts, but the number of wins divided by the numbers of majors that occurred during the remaining years when a professional golfer might plausibly win a major championship. Seve Ballesteros is an example of why this distinction is important. Ballesteros won five major championships, the last at age 31. Then he developed back problems, and his ability to win majors (or any other tournament) was effectively ended by the time he was in his late thirties. But golfers tend to develop physical problems as they age. The failure of a championship golfer to compete in major championships is an extremely good indicator of not being able to win if he had competed.

Therefore the data are based on the total number of majors that occurred from the year that the golfer turned 20 (when a golfer might plausibly win a major championship, given that Tiger Woods did so at 21) through the year that the golfer turned 49 (a plausible upper bound, because Julius Boros won a PGA championship at age 48). Operationally, this means that the database contains 30×4=120 lines for each subject over the entire course of his career. Years that were age-eligible but predate 1934 or postdate 2012 are deleted from the database.

For the analysis, I will look at the results for several subsamples. The one that appeals to me a priori consists of golfers who were born after the beginning of 1910 (a way of defining modern golf–it barely gets in both Hogan and Snead, the earliest golfers who intuitively seem to belong) and won at least two Majors (winnowing out the flukes). From now on, that’s what I mean by “the sample.”

2. Create a binary variable WIN for each line scored 0 if the subject did not win and 1 if he did.

3. Create a variable FLOORAGE that is the floor of the age of the subject at the time the tournament occurred.

4. Create SUCCESSRATE for each FLOORAGE. In Stata, I created this with

tabstat win if majors>=2 & born2>=3654,by(floorage) stat(mean sum count)

These were the results:

floorage mean sum N

20 0 0 159
21 .0061728 1 162
22 .0116959 2 171
23 .0282486 5 177
24 .0277778 5 180
25 .0552486 10 181
26 .0540541 10 185
27 .0597826 11 184
28 .048913 9 184
29 .0382514 7 183
30 .0621469 11 177
31 .0584795 10 171
32 .0982659 17 173
33 .0738636 13 176
34 .0594595 11 185
35 .0860215 16 186
36 .0376344 7 186
37 .0326087 6 184
38 .048913 9 184
39 .0434783 8 184
40 .0274725 5 182
41 .0222222 4 180
42 .0115607 2 173
43 .0180723 3 166
44 .0060976 1 164
45 .0060976 1 164
46 .0062893 1 159
47 0 0 154
48 .0065789 1 152
49 0 0 80

So among that sample, at age 36, we have an n of 186, 7 wins, for a SUCCESSRATE of .0376. I think my shaky grasp of probabilities tells me that the probability of one of these golfers winning at least one major at age 36 was 14.2%. But that grasp doesn’t extend to calculating the probability of winning an aggregate number of majors over several years. Specifically, in the case of Tiger, I’d like a way of expressing his relationship to the sample from age 20-33 (from his debut to the Thanksgiving catastrophe), and a way of expressing how far above the experience of the sample he has to perform to get at least 5 more Majors from now through age 49 (5 is the number he needs to break Jack’s record).

Any thoughts on how to do what I [Murray] have planned and what might be a better strategy would be appreciated.

Here are my suggestions, as a statistician:

1. Don’t just look at “majors.” There’s information in every tournament that a player competes in. By looking only at a small subset, you’re just reducing your sample size. To put it another way: it’s fine to make inference about that subset but it’s best to use all available data to fit the model that you’ll use to make those inferences. If you want, you can throw in a predictor for the importance of the tournament, although I doubt that will be necessary.

2. Don’t just look at win/loss. Instead I think it makes more sense to code the player’s position (rank among the winners, or maybe score compared to the top score in the tournament). Again, even if all you care about is predicting wins, you’ll do better to include more information in your analyses.

Points 1 and 2 (which are really just two cases of the same general point) are surprisingly (to me) difficult for people to grasp. Even respected researchers will, for example, study elections by looking at win/loss rather than vote share. There is a logical appeal to this sort of “reduced-form” model but all the logical appeal in the world pales beside the imperative of data.