# Veterinary Epidemiologic Research: Count and Rate Data – Zero Counts

May 6, 2013
By

(This article was first published on denis haine » R, and kindly contributed to R-bloggers)

Continuing on the examples from the book Veterinary Epidemiologic Research, we look today at modelling count when the count of zeros may be higher or lower than expected from a Poisson or negative binomial distribution. When there’s an excess of zero counts, you can fit either a zero-inflated model or a hurdle model. If zero counts are not possible, a zero-truncated model can be use.

Zero-inflated models

Zero-inflated models manage an excess of zero counts by simultaneously fitting a binary model and a Poisson (or negative binomial) model. In R, you can fit zero-inflated models (and hurdle models) with the library pscl. We use the fec dataset which give the fecal egg counts of gastro-intestinal nematodes from 313 cows in 38 dairy herds where half of the observations have zero counts. The predictors in the model are lactation (primiparous vs. multiparous), access to pasture, manure spread on heifer pasture, and manure spread on cow pasture.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)

library(pscl)
mod3 <- zeroinfl(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod3)

Call:
zeroinfl(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec,
dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-0.5055 -0.4537 -0.3624 -0.1426 27.3064

Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.36949    0.13117  18.064  < 2e-16 ***
lact        -1.15847    0.10664 -10.864  < 2e-16 ***
past_lact    0.53666    0.14233   3.771 0.000163 ***
man_heif    -0.99849    0.14216  -7.024 2.16e-12 ***
man_lact     1.07858    0.15789   6.831 8.43e-12 ***
Log(theta)  -1.35981    0.05425 -25.065  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9303     0.2932  -3.173  0.00151 **
lact          0.8701     0.3083   2.822  0.00477 **
past_lact    -1.8003     0.3989  -4.513  6.4e-06 ***
man_heif     -0.7702     0.4669  -1.650  0.09903 .
man_lact    -12.2380   167.9185  -0.073  0.94190
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta = 0.2567
Number of iterations in BFGS optimization: 47
Log-likelihood: -5239 on 11 Df


We can assess if the zero-inflated model fits the data better than a Poisson or negative binomial model with a Vuong test. If the value of the test is 1.96 indicates superiority of model 1 over model 2. If the value lies between -1.96 and 1.96, neither model is preferred.

### fit same model with negative binomial
library(MASS)
mod3.nb <- glm.nb(fec ~ lact + past_lact + man_heif + man_lact, data = fec)
### Vuong test
vuong(mod3, mod3.nb)
Vuong Non-Nested Hypothesis Test-Statistic: 3.308663
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model1 > model2, with p-value 0.0004687128

### alpha
1/mod3\$theta
[1] 3.895448


The Vuong statistic is 3.3 (p < 0.001) indicating the first model, the zero-inflated one, is superior to the regular negative binomial. Note that R does not estimate $\alpha$ but its inverse, $\theta$. $\alpha$ is 3.9, suggesting a negative binomial is preferable to a Poisson model.
The parameter modelled in the binary part is the probability of a zero count: having lactating cows on pasture reduced the probability of a zero count ($\beta$ = -1.8), and increased the expected count if it was non-zero ($\beta$ = 0.54).

Hurdle models

A hurdle model has also 2 components but it is based on the assumption that zero counts arise from only one process and non-zero counts are determined by a different process. The odds of non-zero count vs. a zero count is modelled by a binomial model while the distribution of non-zero counts is modelled by a zero-truncated model. We refit the fec dataset above with a negative binomial hurdle model:

mod4 <- hurdle(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod4)

Call:
hurdle(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec,
dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-0.4598 -0.3914 -0.3130 -0.1479 23.6774

Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.1790     0.4801   2.456  0.01407 *
lact         -1.1349     0.1386  -8.187 2.69e-16 ***
past_lact     0.5813     0.1782   3.261  0.00111 **
man_heif     -0.9854     0.1832  -5.379 7.50e-08 ***
man_lact      1.0139     0.1998   5.075 3.87e-07 ***
Log(theta)   -2.9111     0.5239  -5.556 2.76e-08 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.13078    0.10434  -1.253  0.21006
lact        -0.84485    0.09728  -8.684  < 2e-16 ***
past_lact    0.84113    0.11326   7.427 1.11e-13 ***
man_heif    -0.35576    0.13582  -2.619  0.00881 **
man_lact     0.85947    0.15337   5.604 2.10e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 0.0544
Number of iterations in BFGS optimization: 25
Log-likelihood: -5211 on 11 Df


We can compare zero-inflated and hurdle models by their log-likelihood. The hurdle model fits better:

logLik(mod4)
'log Lik.' -5211.18 (df=11)
logLik(mod3)
'log Lik.' -5238.622 (df=11)
### Null model for model 3:
mod3.null <- update(mod3, . ~ 1)
logLik(mod3.null)
'log Lik.' -5428.732 (df=3)


Zero-truncated model

Sometimes zero counts are not possible. In a zero-truncated model, the probability of a zero count is computed from a Poisson (or negative binomial) distribution and this value is subtracted from 1. The remaining probabilities are rescaled based on this difference so they total 1. We use the daisy2 dataset and look at the number of services required for conception (which cannot be zero…) with the predictors parity, days from calving to first service, and presence/absence of vaginal discharge.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(VGAM)
mod5 <- vglm(spc ~ parity + cf + vag_disch, family = posnegbinomial(), data = daisy2)
summary(mod5)

Call:
vglm(formula = spc ~ parity + cf + vag_disch, family = posnegbinomial(),
data = daisy2)

Pearson Residuals:
Min       1Q    Median      3Q    Max
log(munb)  -1.1055 -0.90954  0.071527 0.82039 3.5591
log(size) -17.4891 -0.39674 -0.260103 0.82155 1.4480

Coefficients:
Estimate Std. Error z value
(Intercept):1  0.1243178 0.08161432  1.5232
(Intercept):2 -0.4348170 0.10003096 -4.3468
parity         0.0497743 0.01213893  4.1004
cf            -0.0040649 0.00068602 -5.9254
vag_disch      0.4704433 0.10888570  4.3205

Number of linear predictors:  2

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -10217.68 on 13731 degrees of freedom

Number of iterations: 5


As cows get older, the number of services required increase, and the longer the first service was delayed, the fewer services were required.
The first intercept is the usual intercept. The second intercept is the over dispersion parameter $\alpha$.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...