**Revolutions**, and kindly contributed to R-bloggers)

A National Institute for Occupational Safety and Health study, published in March, found that professional American football (NFL) players lived longer, on average, than similar "mere mortals" in the general population. Football is a dangerous sport, so that might seem surprising at first, until you consider the fact that NFL players are elite sportsmen: only the strongest, fastest and most healthy members of the population get a chance to play. On top of that, they have access to excellent healthcare, even after they retire.

But it's important to note what this study does **NOT** say: it does not claim that playing professional football will make you live longer. (In fact, throwing a couple of dozen men selected at random into a scrimmage with the Dallas Cowboys is most likely to *shorten* their lifespan, given the inevitable injuries that would entail.) All it says that the population of men selected to play in the NFL tend to live longer than similar counterparts in the general population.

Sports Journalist Bill Barnwell attempted to answer the question by using a different method: why not compare NFL players to baseball players, who are also elite athletes. But as R user and biostatistician Gregory Matthews reports, Barnwell's "Mere Mortals" article makes some profound statistical errors in making the claim that MLB players live shorter lives than NFL players.

The basic error is that the populations of MLB players and NFL players are *not* directly comparable, mainly because baseball players tend to be older than football players. Matthews created the age distribution chart below to illustrate the difference (blue is baseball, red is football):

If you merely count the number of baseball players that have died and compare that to the number of football players have died, you'll find more baseball player deaths:

Baseball | Football | |
---|---|---|

Qualifying Players | 1,494 | 3,088 |

Alive | 1,256 | 2,694 |

Deceased | 238 | 394 |

Mortality Rate |
15.9 percent |
12.8 percent |

Even though that difference is statistically significant (using a Fisher Test) this is hardly surprising: because the average baseballer in the sample was older than the average footballer, you'd expect to find more deaths in the interim. But if you include player age in a logistic regression model, as Gregory Matthews used R to do, the effect of the sport (baseball vs football) disappears. It's only the relative age differences between the sports that causes the discrepancy in the mortality rates above.

There are two lessons to learn from this tale:

- When you read reports in the newspaper about how 'X causes Y', always ask yourself whether the direction of causality is correct. While researches and journals are (for the most part) careful to merely report that there's an
*association*between X and Y, journalists often upgrade that to direct causality. Sometimes it's just the case that people who are afflicted by Y tend to be people who are likely to have consumed/performed/been exposed to X. - Unless a formal random trial has been conducted, whenever two (or more) groups A and B are compared, it's amost never appropriate to directly compare averages or other statistics with the two groups. The analysis
*must*control for any factors (age, sex, location, lifestyle, genetics, …) that might influence the outcome, using regression or other appropriate statistical techniques. And even then, you can never be certain that all possible variables have been controlled for: it can strengthen the confidence of the association, but never prove causality outright.

Thanks go to Gregory Matthews for illustrating these important lessons in his critique of Bill Barnwell's "Mere Mortals" article, linked below.

Stats in the Wild: Mere Mortals: Retract This Article

**leave a comment**for the author, please follow the link and comment on their blog:

**Revolutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...