Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models

[This article was first published on Datavore Consulting » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data. [As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.] The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance>mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.

First, we will define a few of the variables that are used repeatedly throughout the subsequent code. [Note: click here to get all code from this post in a single file.] We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.

n = 100                            # No. of samples per treatment
mean.A = 10                        # Mean count for treatment A
mean.B = 5                         # Mean count for treatment B
nd = data.frame(Trt = c("A","B"))  # Used in predict( ) function

Poisson data

We generate random variates from a Poisson distribution with the rpois( ) function. Because mean=variance in a Poisson distribution, we only need to specify the number of random variates and the expected mean of the distribution.

data.pois = data.frame(Trt = c(rep("A", n), rep("B", n)), 
                       Response = c(rpois(n, mean.A), rpois(n, mean.B))
                       )

Poisson model

Now, we run the GLM and set the error distribution to Poisson,

model.pois = glm(Response ~ Trt, data = data.pois, family = poisson)
summary(model.pois)

which produces the following output.

Call:
glm(formula = Response ~ Trt, family = poisson, data = data.pois)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1000  -0.6647   0.1045   0.6032   2.2611  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.30558    0.03158   73.02  < 2e-16 ***
TrtB        -0.74323    0.05562  -13.36  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 374.94  on 199  degrees of freedom
Residual deviance: 183.85  on 198  degrees of freedom
AIC: 934.08

Number of Fisher Scoring iterations: 4

We test for goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom.

1 - pchisq(summary(model.pois)$deviance, 
           summary(model.pois)$df.residual
           )

[1] 0.7565471

The GOF test indicates that the Poisson model fits the data (p > 0.05). If this were your actual data, you could breathe a sigh of relief because you could stop here. Well, not quite here. You will still want to use the model to predict mean counts for each treatment and standard errors for each parameter.

cbind(nd, 
      Mean = predict(model.pois, newdata=nd, type="response"), 
      SE = predict(model.pois, newdata=nd, type="response", se.fit=T)$se.fit
      )

Trt  Mean        SE
1   A 10.03 0.3167015
2   B  4.77 0.2184031

Because we used a large sample size, the predicted means are similar to the expected means of 10 and 5.

Negative binomial data

Next we will use the 'MASS' package to generate random deviates from a negative binomial distribution, which involves a parameter, theta, that controls the variance of the distribution.

library(MASS)
data.nb = data.frame(Trt = c(rep("A", n), rep("B", n)), 
                     Response=c(rnegbin(n, mean.A, 5), rnegbin(n, mean.B, 5))
                     )

Poisson model

We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).

model.pois.2 = glm(Response ~ Trt, data = data.nb, family = poisson)
summary(model.pois.2)

Call:
glm(formula = Response ~ Trt, family = poisson, data = data.nb)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.7258  -1.4732  -0.0846   1.1024   4.5505  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.32923    0.03120   74.64  < 2e-16 ***
TrtB        -0.70589    0.05428  -13.01  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 740.63  on 199  degrees of freedom
Residual deviance: 560.82  on 198  degrees of freedom
AIC: 1283.8

Number of Fisher Scoring iterations: 5

1 - pchisq(summary(model.pois.2)$deviance,
           summary(model.pois.2)$df.residual
           )

[1] 0

As expected, the Poisson model does not fit the data (p < 0.05), but we will still take a look at the predicted values.

cbind(nd, 
      Mean = predict(model.pois.2, newdata = nd, type = "response"), 
      SE = predict(model.pois.2, newdata = nd, type = "response", se.fit = T)$se.fit
      )

Trt  Mean        SE
1   A 10.27 0.3204684
2   B  5.07 0.2251666

Make a note of the SEs in this output because we will come back to this after we run a GLM based on a negative binomial error distribution.

Negative binomial model

model.nb = glm.nb(Response ~ Trt, data = data.nb)
summary(model.nb)

Call:
glm.nb(formula = Response ~ Trt, data = data.nb, init.theta = 4.114111123, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5705  -0.8806  -0.0454   0.5676   2.2889  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.32923    0.05835  39.920  < 2e-16 ***
TrtB        -0.70589    0.08836  -7.989 1.36e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(4.1141) family taken to be 1)

    Null deviance: 280.56  on 199  degrees of freedom
Residual deviance: 216.42  on 198  degrees of freedom
AIC: 1136.8

Number of Fisher Scoring iterations: 1

              Theta:  4.114 
          Std. Err.:  0.668 

 2 x log-likelihood:  -1130.825

The model estimates the dispersion parameter as 4.114; we set theta = 5 when generating random variates.

1 - pchisq(summary(model.nb)$deviance,
           summary(model.nb)$df.residual
           )

[1] 0.1756968

The GOF test indicates that the negative binomial model fits the data.

cbind(nd, 
      Mean = predict(model.nb, newdata = nd, type = "response"), 
      SE = predict(model.nb, newdata = nd, type="response", se.fit = T)$se.fit
      )

Trt  Mean        SE
1   A 10.27 0.5992233
2   B  5.07 0.3364221

Here you see the 'danger' of ignoring overdispersion in the Poisson model. The SE estimates are lower for the Poisson model than for the negative binomial model, which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.

Zero-inflated Poisson data

Lastly, we will add more more layer of complication to the story. If you have lots of zeros in your data, and have determined that Poisson and negative binomial models do not fit your data well, then you should turn to zero-inflated models with either Poisson or negative binomial error distributions. We need the 'VGAM' package to generate random variates from a zero-inflated Poisson distribution using the rzipois( ) function. The 3rd argument to the rzipois( ) function specifies the probability of drawing a zero beyond the expected number of zeros for a Poisson distribution with the specified mean. Here were are introducing a relatively small proportion of extra zeros and the same proportion for each treatment.

library(VGAM)
data.zip = data.frame(Trt = c(rep("A", n), rep("B", n)), 
                      Response = c(rzipois(n, mean.A, 0.2), rzipois(n, mean.B, 0.2))
                      )

Poisson model

We first fit the Poisson model.

model.pois.3 = glm(Response ~ Trt, data = data.zip, family = poisson)
summary(model.pois.3)

Call:
glm(formula = Response ~ Trt, family = poisson, data = data.zip)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.9166  -1.8131   0.1183   1.3324   2.8984  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.03732    0.03611   56.42  < 2e-16 ***
TrtB        -0.64107    0.06147  -10.43  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 869.33  on 199  degrees of freedom
Residual deviance: 754.93  on 198  degrees of freedom
AIC: 1337.3

Number of Fisher Scoring iterations: 5

1 - pchisq(summary(model.pois.3)$deviance,
           summary(model.pois.3)$df.residual
           )

[1] 0

The Poisson model does not fit the data.

cbind(nd, 
      Mean = predict(model.pois.3, newdata = nd, type = "response"), 
      SE = predict(model.pois.3, newdata = nd, type = "response", se.fit = T)$se.fit
      )

Trt Mean        SE
1   A 7.67 0.2769476
2   B 4.04 0.2009973

The Poisson model also does not predict the correct mean counts.

Negative binomial model

Next, we fit a negative binomial model.

model.nb.2 = glm.nb(Response ~ Trt, data = data.zip)
summary(model.nb.2)

Call:
glm.nb(formula = Response ~ Trt, data = data.zip, init.theta = 1.47231218, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3189  -0.9259   0.0472   0.5432   1.2916  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.03732    0.08998  22.643  < 2e-16 ***
TrtB        -0.64107    0.13177  -4.865 1.14e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(1.4723) family taken to be 1)

    Null deviance: 280.22  on 199  degrees of freedom
Residual deviance: 256.67  on 198  degrees of freedom
AIC: 1122

Number of Fisher Scoring iterations: 1

              Theta:  1.472 
          Std. Err.:  0.234 

 2 x log-likelihood:  -1116.006

1 - pchisq(summary(model.nb.2)$deviance,
           summary(model.nb.2)$df.residual
           )

[1] 0.003149441

The negative binomial model does not fit the data.

cbind(nd, 
      Mean = predict(model.nb.2, newdata = nd, type = "response"), 
      SE = predict(model.nb.2, newdata = nd, type = "response", se.fit = T)$se.fit
      )

Trt Mean        SE
1   A 7.67 0.6901218
2   B 4.04 0.3889176

And does not predict the correct means.

Zero-inflated Poisson models

We load the 'pscl' package to fit the zero-inflated model. First, we fit a model where we assume that the probability of zero is the same for both treatments (with '~ Trt|1').

library(pscl)
model.zip = zeroinfl(Response ~ Trt|1, data = data.zip)
summary(model.zip)

Call:
zeroinfl(formula = Response ~ Trt | 1, data = data.zip)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5073 -0.9286  0.1330  0.8941  2.1819 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.26040    0.03612  62.584  < 2e-16 ***
TrtB        -0.55430    0.06203  -8.937  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1903     0.1681  -7.082 1.42e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 13 
Log-likelihood:  -477 on 3 Df

As expected, the model output indicates that there are significantly more zeros than expected for a Poisson distribution.

cbind(nd, 
      Count = predict(model.zip, newdata = nd, type = "count"),
      Zero = predict(model.zip, newdata = nd, type = "zero")
      )

Trt    Count      Zero
1   A 9.586959 0.2332005
2   B 5.507434 0.2332005

The zero-inflated model predicts the correct mean counts and probability of zero.

If we fit a zero-inflated model to test a treatment effect for both the counts and the zeros (with '~ Trt|Trt'),

model.zip.2 = zeroinfl(Response ~ Trt|Trt, data = data.zip)
summary(model.zip.2)

Call:
zeroinfl(formula = Response ~ Trt | Trt, data = data.zip)

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-1.62159 -0.96198  0.06977  0.91545  2.20244 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.26039    0.03612  62.580  < 2e-16 ***
TrtB        -0.55348    0.06194  -8.936  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3866     0.2501  -5.545 2.94e-08 ***
TrtB          0.3769     0.3383   1.114    0.265    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 16 
Log-likelihood: -476.4 on 4 Df

we see that there are significantly more zeros than expected, but that the probability of zero is not significantly different between the two treatments.

cbind(nd, 
      Count = predict(model.zip.2, newdata = nd, type = "count"),
      Zero = predict(model.zip.2, newdata = nd, type = "zero")
      )

Trt    Count      Zero
1   A 9.586842 0.1999451
2   B 5.511897 0.2670400

Zero-inflated negative binomial model

We can test for overdispersion in the count part of the zero-inflated model by specifying a negative binomial distribution.

model.zip.3 = zeroinfl(Response ~ Trt|1, data = data.zip, dist = "negbin")
summary(model.zip.3)

Call:
zeroinfl(formula = Response ~ Trt | 1, data = data.zip, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5072 -0.9285  0.1330  0.8940  2.1818 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.26040    0.03612  62.574  < 2e-16 ***
TrtB        -0.55431    0.06203  -8.935  < 2e-16 ***
Log(theta)  10.26954   66.78568   0.154    0.878    

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1903     0.1681  -7.082 1.42e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 28840.7026 
Number of iterations in BFGS optimization: 32 
Log-likelihood:  -477 on 4 Df

The estimated theta parameter is not significant indicating that the zero-inflated Poisson model is appropriate.

To leave a comment for the author, please follow the link and comment on their blog: Datavore Consulting » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)