Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios

May 10, 2013
By

(This article was first published on denis haine » R, and kindly contributed to R-bloggers)

As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.

One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).

We use data on culling of dairy cows to demonstrate this.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)
load(unz(temp, "ver2_data_R/culling.rdata"))
unlink(temp)

table(culling$culled)

  0   1 
255 466 
### recoding number of lactation
culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))

The logistic regression:

log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling)
summary(log.reg)

Call:
glm(formula = culled ~ lact, family = binomial("logit"), data = culling)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.546  -1.199   0.849   0.849   1.156  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.734      0.299   -2.45    0.014 *  
lact           0.784      0.171    4.59  4.4e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 4

cbind(exp(coef(log.reg)), exp(confint(log.reg)))
Waiting for profiling to be done...
                 2.5 % 97.5 %
(Intercept) 0.48 0.267  0.864
lact        2.19 1.568  3.065

We could compute by hand the OR and RR:

with(culling, table(culled, lact))
      lact
culled   1   2
     0  97 158
     1 102 364
## OR is 2.19:
(364 / 158) / (102 / 97)
[1] 2.19
## or the ratio between the cross-products
(364 * 97) / (158 * 102)
[1] 2.19
## or with epicalc
library(epicalc)
with(culling, cc(culled, lact, graph = FALSE))

       lact
culled    1   2 Total
  0      97 158   255
  1     102 364   466
  Total 199 522   721

OR =  2.19 
Exact 95% CI =  1.54, 3.1  
Chi-squared = 21.51, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0 

# RR is 1.36:
(364 / 522) / (102 / 199)
[1] 1.36

The GLM with binomial distribution and log link:

log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling)
summary(log.reg2)

Call:
glm(formula = culled ~ lact, family = binomial(link = "log"), 
    data = culling)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.546  -1.199   0.849   0.849   1.156  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 5

cbind(exp(coef(log.reg2)), exp(confint(log.reg2)))
Waiting for profiling to be done...
                  2.5 % 97.5 %
(Intercept) 0.377  0.28  0.488
lact        1.360  1.18  1.588

The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):

## create id for each observation
culling$id <- 1:length(culling$herd)
library(geepack)
zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling)
summary(zou.mod)

Call:
geeglm(formula = culled ~ lact, family = poisson(link = "log"), 
    data = culling, id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err Wald Pr(>|W|)    
(Intercept)  -0.9762  0.1412 47.8  4.8e-12 ***
lact          0.3078  0.0749 16.9  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Estimated Scale Parameters:
            Estimate Std.err
(Intercept)    0.354   0.021

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha        0       0
Number of clusters:   721   Maximum cluster size: 1 
## exponentiated coefficients
exp(coef(zou.mod))
(Intercept)        lact 
      0.377       1.360 
## getting confidence intervals
library(doBy)
zou.mod.coefci <- esticon(zou.mod, diag(2))
zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper))
rownames(zou.mod.expci) <- names(coef(zou.mod))
colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR")
zou.mod.expci
               RR Lower RR Upper RR
(Intercept) 0.377    0.286    0.497
lact        1.360    1.175    1.576

Or the same using glm() and then getting robust standard errors:

pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling)
library(sandwich) # to get robust estimators
library(lmtest) # to test coefficients
coeftest(pois.reg, vcov = sandwich)

z test of coefficients:

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

## RR
exp(coef(pois.reg))
(Intercept)        lact 
      0.377       1.360 
## CI
0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI
[1] 0.161
exp(0.3078 + qnorm(0.05 / 2) * 0.0749)
[1] 1.17
0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI
[1] 0.455
exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749)
[1] 1.58

So no excuse to use odds ratios when risk ratios are more appropriate!

Addition: Plotting the Risk Ratios

zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"),
                   id = id, corstr = "exchangeable", data = culling)
zou.mod2.coefci <- esticon(zou.mod2, diag(3))
zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower,
                           zou.mod2.coefci$Upper))
rownames(zou.mod2.expci) <- names(coef(zou.mod2))
colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI")
zou.df <- as.data.frame(zou.mod2.expci)
zou.df$var <- row.names(zou.df)
library(ggplot2)
ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) +
  geom_point() +
  geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) + 
  scale_x_discrete(labels = c("Lactation", "Johnes")) + 
  scale_y_log10(breaks = seq(1, 2, by = 0.1)) +
  xlab("") + 
  geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") +
  coord_flip()

Thanks to Tal Galili for suggesting this addition.


To leave a comment for the author, please follow the link and comment on his blog: denis haine » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.