Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.
One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).
We use data on culling of dairy cows to demonstrate this.
temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) table(culling$culled) 0 1 255 466 ### recoding number of lactation culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))
The logistic regression:
log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling)
summary(log.reg)
Call:
glm(formula = culled ~ lact, family = binomial("logit"), data = culling)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.546 -1.199 0.849 0.849 1.156
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.734 0.299 -2.45 0.014 *
lact 0.784 0.171 4.59 4.4e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 936.86 on 720 degrees of freedom
Residual deviance: 915.84 on 719 degrees of freedom
AIC: 919.8
Number of Fisher Scoring iterations: 4
cbind(exp(coef(log.reg)), exp(confint(log.reg)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.48 0.267 0.864
lact 2.19 1.568 3.065
We could compute by hand the OR and RR:
with(culling, table(culled, lact))
lact
culled 1 2
0 97 158
1 102 364
## OR is 2.19:
(364 / 158) / (102 / 97)
[1] 2.19
## or the ratio between the cross-products
(364 * 97) / (158 * 102)
[1] 2.19
## or with epicalc
library(epicalc)
with(culling, cc(culled, lact, graph = FALSE))
lact
culled 1 2 Total
0 97 158 255
1 102 364 466
Total 199 522 721
OR = 2.19
Exact 95% CI = 1.54, 3.1
Chi-squared = 21.51, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0
# RR is 1.36:
(364 / 522) / (102 / 199)
[1] 1.36
The GLM with binomial distribution and log link:
log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling)
summary(log.reg2)
Call:
glm(formula = culled ~ lact, family = binomial(link = "log"),
data = culling)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.546 -1.199 0.849 0.849 1.156
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9762 0.1412 -6.91 4.8e-12 ***
lact 0.3078 0.0749 4.11 4.0e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 936.86 on 720 degrees of freedom
Residual deviance: 915.84 on 719 degrees of freedom
AIC: 919.8
Number of Fisher Scoring iterations: 5
cbind(exp(coef(log.reg2)), exp(confint(log.reg2)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.377 0.28 0.488
lact 1.360 1.18 1.588
The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):
## create id for each observation
culling$id <- 1:length(culling$herd)
library(geepack)
zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling)
summary(zou.mod)
Call:
geeglm(formula = culled ~ lact, family = poisson(link = "log"),
data = culling, id = id, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -0.9762 0.1412 47.8 4.8e-12 ***
lact 0.3078 0.0749 16.9 4.0e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 0.354 0.021
Correlation: Structure = exchangeable Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0 0
Number of clusters: 721 Maximum cluster size: 1
## exponentiated coefficients
exp(coef(zou.mod))
(Intercept) lact
0.377 1.360
## getting confidence intervals
library(doBy)
zou.mod.coefci <- esticon(zou.mod, diag(2))
zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper))
rownames(zou.mod.expci) <- names(coef(zou.mod))
colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR")
zou.mod.expci
RR Lower RR Upper RR
(Intercept) 0.377 0.286 0.497
lact 1.360 1.175 1.576
Or the same using glm() and then getting robust standard errors:
pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling)
library(sandwich) # to get robust estimators
library(lmtest) # to test coefficients
coeftest(pois.reg, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9762 0.1412 -6.91 4.8e-12 ***
lact 0.3078 0.0749 4.11 4.0e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## RR
exp(coef(pois.reg))
(Intercept) lact
0.377 1.360
## CI
0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI
[1] 0.161
exp(0.3078 + qnorm(0.05 / 2) * 0.0749)
[1] 1.17
0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI
[1] 0.455
exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749)
[1] 1.58
So no excuse to use odds ratios when risk ratios are more appropriate!
Addition: Plotting the Risk Ratios
zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"),
id = id, corstr = "exchangeable", data = culling)
zou.mod2.coefci <- esticon(zou.mod2, diag(3))
zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower,
zou.mod2.coefci$Upper))
rownames(zou.mod2.expci) <- names(coef(zou.mod2))
colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI")
zou.df <- as.data.frame(zou.mod2.expci)
zou.df$var <- row.names(zou.df)
library(ggplot2)
ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) +
geom_point() +
geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) +
scale_x_discrete(labels = c("Lactation", "Johnes")) +
scale_y_log10(breaks = seq(1, 2, by = 0.1)) +
xlab("") +
geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") +
coord_flip()
Thanks to Tal Galili for suggesting this addition.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
