Correcting for confounded variables with GLMs

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Correcting for confounded variables with GLMs

General (and generalized) linear models can be useful for analyzing field data, where sampling is often distributed unevenly across different environmental gradients or treatment groups. They help us correct for confounded gradients and discover effects that are hidden in plots of raw data.

For instance, we used GLMs in a meta-analysis of the rates species are shifting their ranges under climate change. We used the GLMs to correct for differences in ways different studies had measured species ranges, so we could then study the unique effects of ecological variables.

You can also think of these GLMs with multiple covariates as ‘statistically’ (rather than experimentally) controlling for the effects of each variable when looking at the effects of the other variables.

In this post I’ll demonstrate this application for statistical controls with a simple example.

Simulate data

We’ll start by simulating some data. Let’s say the data represent fish abundance at sites. Fish vary in the area of fish habitat available at that site. There are also two bioregions we’ve sampled. We’ll assume the fish only occur on one bioregion, even though habitat is available in both bioregions.

We’ll assume data are poisson distributed, to account for the fact that fish numbers can’t be <0, and that they should be integers.

To simulate the data we will specify an intercept and an habitat effect b_habitat. We’ll assume two bioregions, so we only need to specify one bioregion effect (which is the mean difference from bioregion B and A)

nsites <- 88

intercept <- 0
b_habitat <- 1
b_region <- -20

Then we need to make up the covariate data.

region <- c(rep(0, nsites/2), rep(1, nsites/2))

habitat_area <- rnorm(nsites, 1.5, 0.5)
habitat_area <- habitat_area*(region + 1)

We made it so habitat area was on average twice as big in the second region.

Now assemble a ‘design matrix’ (X) and matrix of effects (ie our ). These help us then make predictions of mean abundance with a simple matrix multiplication:

coefs_true <- matrix(c(intercept, b_habitat, b_region))
X <- matrix(cbind(1, habitat_area, region), ncol = 3)

Now make the ‘true’ means for our simulation and the fish abundance data:

y_mean <- exp(X %*% coefs_true)
y <- rpois(nsites, y_mean)

We took an exponent to ensure positive values.

If this confuses you so far, you can read more about linear models and then generalized linear models on my other blogs.

Plot simulations

plot(habitat_area, y)

So not strong evidence of any effect of habitat area on abundance. What about if we colour points by bioregion?

plot(habitat_area, y, col = region + 1)

So it appears the red region (the region with 1 in the region variable) has not habitat relationship, whereas the black region (region with 0 in region variable) has a positive relationship. This occurs because we set the region intercept to -20 (and a poisson with mean exp(-20) will predict basically always zero).

Naive GLM

Now let’s fit a GLM assuming we don’t know about bioregion

m1 <- glm(y ~ habitat_area, family = "poisson")

## Call:
## glm(formula = y ~ habitat_area, family = "poisson")
## Deviance Residuals:
##    Min      1Q  Median      3Q     Max  
## -3.071  -1.783  -1.281   0.229   7.194  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.91396    0.15321  12.493  < 2e-16 ***
## habitat_area -0.53476    0.08272  -6.464 1.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for poisson family taken to be 1)
##     Null deviance: 401.78  on 87  degrees of freedom
## Residual deviance: 354.11  on 86  degrees of freedom
## AIC: 501.23
## Number of Fisher Scoring iterations: 7

Habitat is significant (p < 0.001), but the estimate is in the wrong direction! It is saying there are fewer fish with more habitat. This is because the second region has on average twice as much habitat, but our fish doesn’t ever occur there.

To see this, we can do a plot with the visreg package:

visreg(m1, scale = "response")

So predicting an effect of habitat area that decreases with habitat area.

Region adjusted GLM

Ok, so now try including region:

m2 <- glm(y ~ habitat_area + region, family = "poisson")

## Call:
## glm(formula = y ~ habitat_area + region, family = "poisson")
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max  
## -2.43472  -0.26179  -0.00006  -0.00002   2.46518  
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.2340     0.2294   1.020    0.308    
## habitat_area    0.8851     0.1319   6.713 1.91e-11 ***
## region        -23.0326  2100.7940  -0.011    0.991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for poisson family taken to be 1)
##     Null deviance: 401.777  on 87  degrees of freedom
## Residual deviance:  54.311  on 85  degrees of freedom
## AIC: 203.43
## Number of Fisher Scoring iterations: 18

Habitat area is significant and positive now. Note also the effect size (0.885) is within the error bounds (SE = 0.132) for the true effect size (=1). Let’s also confirm our better model (most parsimonious) with the AIC (you can read more about what the AIC is on this blog).

AIC(m1, m2)

##    df      AIC
## m1  2 501.2268
## m2  3 203.4282

So there’s basically no evidence for the model without bioregion compared to the model with that effect.

But notice that the region effect is approximately correct (-23), but the standard error is huge and, consequently, the region effect is non-signficant (p = 0.99). This happens because the abundance is always zero in region 2, so the poisson model can’t get a good estimate of the mean abundance. Put another way, the poisson must have mean >0, but it can’t distinguish between a mean of 0.000000001 versus say 0.0001 (they both predict basically all zeros), so the SE are very broad.

We see this if we do a plot, and habitat area appears insignificant:

visreg(m2, xvar = "habitat_area")

Now replot the predictions, asking for habitat area just in the first bioregion (labelled 0 in the data):

visreg(m2, xvar = "habitat_area",
       cond = list(region = 0), scale = "response")

Now we see the trend.

So our GLM has corrected for the absence of fish in bioregion 2 and correctly identified the positive relationship between abundance and fish biomass. The GLM with the region effect also gets the magnitude of the trend approximately correct (within the margin of error from the true value).

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)