More useless statistics

August 22, 2011
By

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as `anscombe`, but it requires a little reshaping to be useful.

```anscombe2 <- with(anscombe, data.frame(
x     = c(x1, x2, x3, x4),
y     = c(y1, y2, y3, y4),
group = gl(4, nrow(anscombe))
))
```

Note the use of `gl` to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using `ddply` from the `plyr` package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

```library(plyr)
(stats <- ddply(anscombe2, .(group), summarize,
mean = mean(y),
std_dev = sd(y),
correlation = cor(x, y),
lm_intercept = lm(y ~ x)\$coefficients[1],
lm_x_effect = lm(y ~ x)\$coefficients[2]
))

group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091
```

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

```library(ggplot2)
(p <- ggplot(anscombe2, aes(x, y)) +
geom_point() +
facet_wrap(~ group)
)
```

Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.

Tagged: anscombe, mean, r, statistics, stats-jokes

To leave a comment for the author, please follow the link and comment on their blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.

Sponsors

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)