Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The IID assumption (independent and identically distributed) is pretty important. Ignoring it can lead you to make incorrect conclusions (usually through pseudoreplication). Here’s a quick example.

You have 50 bags filled with 1 red and 9 green balls. You randomly draw 1 ball, and record the colour from each bag. You draw the red ball 5 times and a green ball 45 times. Let’s put that into a 2×2.

x1 <- matrix(
c(405, 45, 45, 5),
nrow = 2,
dimnames = list(
c("Green", "Red"),
c("Not drawn", "Drawn")
)
)

> x1
Not drawn Drawn
Green       405    45
Red          45     5


I’ll fit a chi-squared test to see if there is a difference between drawing a green ball to drawing a red ball. Maybe the red balls are bigger, are rougher, lighter, stickier, something which means it may be more likely to be drawn from the bag over the green balls.

> chisq.test(x1)

Pearson's Chi-squared test

data:  x1
X-squared = 0, df = 1, p-value = 1


Nope. The p-value is 1 because this is exactly what we would expect to draw if the balls were the same just different colours.

Now perhaps you have 40 bags with 3 red balls and 1 green ball. It’s more likely you’ll draw more red balls in this case. You draw 30 red balls and 10 green balls. Let’s put it into a 2×2 and fit a chi-squared test again.

x2 <- matrix(
c(30, 90, 10, 30),
nrow = 2,
dimnames = list(
c("Green", "Red"),
c("Not drawn", "Drawn")
)
)

> x2
Not drawn Drawn
Green        30    10
Red          90    30

> chisq.test(x2)

Pearson's Chi-squared test

data:  x2
X-squared = 0, df = 1, p-value = 1


Look at that, the p-value is 1, we drew exactly what was expected. In which case we would conclude that the red and green balls are not different and the red balls are not more likely to be drawn than green balls.

Because we’re lazy, let’s combine the draws and fit the chi-squared test again.

> x1+x2
Not drawn Drawn
Green       435    55
Red         135    35

> chisq.test(x1+x2)

Pearson's Chi-squared test with Yates' continuity correction

data:  x1 + x2
X-squared = 8.6183, df = 1, p-value = 0.003328


Now, seemingly by magic, there is a difference. We might conclude that there is a difference between the balls and that the red ones are more likely to be drawn.

We know this isn’t correct though, we set this up such that each test was not significant and we drew the expected number of red balls. So, why is this now showing a strong association?

It’s because the observations are not iid. The observation is the colour of the ball being drawn from the bag and there is only one of them, not all 10 or 4 that are drawn. If one ball is drawn it means all the others can’t be drawn.

The probability of drawing a red in the first set is 1/10 and in the second set, it’s 3/4. They are not the same and the model needs to be set up in a way that accounts for that.

This problem can be structured as a regression problem and you’ll get the same result, in isolation the first and second examples are not significant but pooled together they are.

So what?

It seems fairly benign and just poor stats, but imagine replacing the bags with job vacancies and the colour of the balls with some demographic variable of applicants e.g. age, gender, race, ethnicity, etc. Suddenly you find yourself in a situation.

I’m sure you could think of other examples where this would be a problem. Unfortunately, I’ve seen this on more than one occasion in the real world. Most recently regarding first boots in Survivor. You can check out that post if you wish (you will see the same example though).

In this case, the observation isn’t the balls in the bag but the one that is drawn. Assuming we know the contents of the bag the correct way to test for an association between colour and being drawn is as follows.

• The observed number of red balls is 5 in the first set and 30 in the second.
• The expected value is 50*1/10 = 5 in the first and 40*3/4 = 30 in the second.
• The test is

And we have observed exactly what was expected, ergo, no further action.

The takeaway is, be careful how you analyse data and keep it in mind when reading others.

The post Ignoring the IID assumption isn’t a great idea appeared first on Dan Oehm | Gradient Descending.