More useless statistics

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as anscombe, but it requires a little reshaping to be useful.

anscombe2 <- with(anscombe, data.frame(
  x     = c(x1, x2, x3, x4),
  y     = c(y1, y2, y3, y4),
  group = gl(4, nrow(anscombe))
))

Note the use of gl to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using ddply from the plyr package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

library(plyr)
(stats <- ddply(anscombe2, .(group), summarize,
  mean = mean(y),
  std_dev = sd(y),
  correlation = cor(x, y),
  lm_intercept = lm(y ~ x)$coefficients[1],
  lm_x_effect = lm(y ~ x)$coefficients[2]
))

  group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

library(ggplot2)
(p <- ggplot(anscombe2, aes(x, y)) +
  geom_point() +
  facet_wrap(~ group)
)

Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.


Tagged: anscombe, mean, r, statistics, stats-jokes

To leave a comment for the author, please follow the link and comment on their blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)