# Correcting for confounded variables with GLMs

**Bluecology blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Correcting for confounded variables with GLMs

General (and generalized) linear models can be useful for analyzing

field data, where sampling is often distributed unevenly across different

environmental gradients or treatment groups. They help us correct for

confounded gradients and discover effects that are hidden in plots of

raw data.

For instance, we used GLMs in a meta-analysis of the rates species are shifting their ranges under climate change. We used the GLMs to correct for differences in ways different studies had measured species ranges, so we could then study the unique effects of ecological variables.

You can also think of these GLMs with multiple covariates as ‘statistically’ (rather than experimentally) controlling for the effects of each variable when looking at the effects of the other variables.

In this post I’ll demonstrate this application for statistical controls with a simple example.

## Simulate data

We’ll start by simulating some data. Let’s say the data represent fish

abundance at sites. Fish vary in the area of fish habitat available at

that site. There are also two bioregions we’ve sampled. We’ll assume the

fish only occur on one bioregion, even though habitat is available in

both bioregions.

We’ll assume data are poisson distributed, to account for the fact that

fish numbers can’t be <0, and that they should be integers.

To simulate the data we will specify an intercept and an habitat effect

`b_habitat`

. We’ll assume two bioregions, so we only need to specify one

bioregion effect (which is the mean difference from bioregion B and A)

nsites <- 88 #Effects intercept <- 0 b_habitat <- 1 b_region <- -20

Then we need to make up the covariate data.

set.seed(50) region <- c(rep(0, nsites/2), rep(1, nsites/2)) habitat_area <- rnorm(nsites, 1.5, 0.5) habitat_area <- habitat_area*(region + 1)

We made it so habitat area was on average twice as big in the second

region.

Now assemble a ‘design matrix’ (`X`

) and matrix of effects (ie our ).

These help us then make predictions of mean abundance with a simple

matrix multiplication:

coefs_true <- matrix(c(intercept, b_habitat, b_region)) X <- matrix(cbind(1, habitat_area, region), ncol = 3)

Now make the ‘true’ means for our simulation and the fish abundance

data:

y_mean <- exp(X %*% coefs_true) y <- rpois(nsites, y_mean)

We took an exponent to ensure positive values.

If this confuses you so far, you can read more about linear models and then generalized linear models on my other blogs.

## Plot simulations

plot(habitat_area, y)

So not strong evidence of any effect of habitat area on abundance. What

about if we colour points by bioregion?

plot(habitat_area, y, col = region + 1)

So it appears the red region (the region with `1`

in the region

variable) has not habitat relationship, whereas the black region (region

with `0`

in region variable) has a positive relationship. This occurs

because we set the region intercept to -20 (and a poisson with mean

exp(-20) will predict basically always zero).

## Naive GLM

Now let’s fit a GLM assuming we don’t know about bioregion

m1 <- glm(y ~ habitat_area, family = "poisson") summary(m1) ## ## Call: ## glm(formula = y ~ habitat_area, family = "poisson") ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.071 -1.783 -1.281 0.229 7.194 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.91396 0.15321 12.493 < 2e-16 *** ## habitat_area -0.53476 0.08272 -6.464 1.02e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 401.78 on 87 degrees of freedom ## Residual deviance: 354.11 on 86 degrees of freedom ## AIC: 501.23 ## ## Number of Fisher Scoring iterations: 7

Habitat is significant (p < 0.001), but the estimate is in the wrong direction! It is saying there are fewer fish with more habitat. This is because the second region has on average twice as much habitat, but our fish doesn’t ever occur there.

To see this, we can do a plot with the `visreg`

package:

library(visreg) visreg(m1, scale = "response")

So predicting an effect of habitat area that decreases with habitat

area.

## Region adjusted GLM

Ok, so now try including region:

m2 <- glm(y ~ habitat_area + region, family = "poisson") summary(m2) ## ## Call: ## glm(formula = y ~ habitat_area + region, family = "poisson") ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.43472 -0.26179 -0.00006 -0.00002 2.46518 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.2340 0.2294 1.020 0.308 ## habitat_area 0.8851 0.1319 6.713 1.91e-11 *** ## region -23.0326 2100.7940 -0.011 0.991 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 401.777 on 87 degrees of freedom ## Residual deviance: 54.311 on 85 degrees of freedom ## AIC: 203.43 ## ## Number of Fisher Scoring iterations: 18

Habitat area is significant and positive now. Note also the effect size

(0.885) is within the error bounds (SE = 0.132) for the true effect size

(=1). Let’s also confirm our better model (most parsimonious) with the

AIC (you can read more about what the AIC is on this blog).

AIC(m1, m2) ## df AIC ## m1 2 501.2268 ## m2 3 203.4282

So there’s basically no evidence for the model without bioregion

compared to the model with that effect.

But notice that the region effect is approximately correct (-23), but

the standard error is huge and, consequently, the region effect is

non-signficant (p = 0.99). This happens because the abundance is always

zero in region 2, so the poisson model can’t get a good estimate of the

mean abundance. Put another way, the poisson must have mean >0, but

it can’t distinguish between a mean of 0.000000001 versus say 0.0001

(they both predict basically all zeros), so the SE are very broad.

We see this if we do a plot, and habitat area appears insignificant:

visreg(m2, xvar = "habitat_area")

Now replot the predictions, asking for habitat area just in the first

bioregion (labelled `0`

in the data):

visreg(m2, xvar = "habitat_area", cond = list(region = 0), scale = "response")

Now we see the trend.

So our GLM has corrected for the absence of fish in bioregion 2 and

correctly identified the positive relationship between abundance and

fish biomass. The GLM with the region effect also gets the magnitude of

the trend approximately correct (within the margin of error from the

true value).

**leave a comment**for the author, please follow the link and comment on their blog:

**Bluecology blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.