Veterinary Epidemiologic Research: GLM – Logistic Regression

[This article was first published on denis haine » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We continue to explore the book Veterinary Epidemiologic Research and today we’ll have a look at generalized linear models (GLM), specifically the logistic regression (chapter 16). In veterinary epidemiology, often the outcome is dichotomous (yes/no), representing the presence or absence of disease or mortality. We code 1 for the presence of the outcome and 0 for its absence. This way, the mean of the outcome represents the proportion of subjects with the event, and its expectation \pi_i represents the probability for the event to occur. By definition, the expectation must lie in the range 0 to 1. We can’t use a linear regression to analyse these data because:

  • the error terms (\epsilon ) from this model will not be normally distributed. They can only take two values,
  • the probability of the outcome occurring, \pi = p(Y = 1) , follows a binomial distribution and depends on the value of the predictor variables, X . Since the variance of a binomial distribution is a function of the probability, the error variance will also vary with the level of X and as such the assumption of homoscedasticity will be violated,
  • the mean responses responses are constrained to lie within 0 and 1. With a linear regression model, the predicted values might fall outside of this range.

One way of getting around the problems described above is to use a suitable transformation, the logistic or logit function of \pi and model this as a linear function of a set of predictor variables. This leads to the model:

logit(\pi) = ln(\frac{p}{1 - p}) = \beta_0 + \beta_1x_1 + \ldots + \beta_jx_j

curve(log(x / (1 - x)), 0, 1)
Logit

Logit

The logit transform is ln(\frac{p}{1 - p}) . The logit of a probability is the log of the odds of the response taking the value 1. Then the above equation can be rewritten as

\pi(x_1, x_2, \ldots, x_j) = \frac{exp(\beta_0 + \beta_1x_1 + \ldots + \beta_jx_j)}{1 + exp(\beta_0 + \beta_1x_1 + \ldots + \beta_jx_j)} .

While the logit function can take any real value, the associated probability always lie within the bounds of 0 and 1. Within a logistic regression model, the parameter \beta_j associated with the explanatory variable x_j is such that exp(\beta_j) is the odds that the response variable takes the value 1 when x_j increases by 1, conditional on the other explanatory variables remaining constant.

As with the linear regression, the logistic regression model involves a linear combination of explanatory variables, even if the binary response is not modelled directly but through a logistic transformation. The expression of predictors on the response from a linear combination of predictors can be extended to models for other types of response than continous or binary and define the wider class of generalized linear models described by Nelder and Wedderburn. GLMs have 3 main features:

  • an error distribution giving the distribution of the response around its mean. For ANOVA and linear regression it is the normal, for logistic regression it is the binomial. These distributons come from the same exponential family of probability distributions,
  • a link function: how the linear function of the explanatory variables is related to the expected value of the response. For logistic regression it is the logit function (but can also be probit (\Phi^{-1}(p) where \Phi^{-1} is the inverse normal cumulative distribution function) or complementary log-log (ln(-ln(1 - p) ) ); for analysis of variance and regression it is the identity function,
  • the variance function: how the variance of the response variable depends on the mean.

The first example is from the Nocardia dataset: a case-control study of Nocardia spp. mastitis, from 54 case herds and 54 control herds, with the predictors related to the management of the cows during the dry period.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", 
temp)
load(unz(temp, "ver2_data_R/nocardia.rdata"))
unlink(temp)

library(Hmisc)
nocardia <- upData(nocardia, 
            labels = c(id = 'Herd identification number',
                       casecont = 'Case/control status of herd',
                       numcow = 'Number of cows milked',
                  prod = 'Average milk production of the herd',
bscc = 'Average bulk tank SCC over the first 6 months of 1988',
                       dbarn = 'Type of barn dry cows kept in',
                  dout = 'Type of outdoor area used for dry cows',
dcprep = 'Method of teat end preparation prior to dry cow therapy administration',
dcpct = 'Percent of dry cows treated with dry-cow therapy',
dneo = 'Dry-cow product containing neomycin used on farm in last year',
dclox = 'Dry cow product containing cloxacillin used on farm in last year',
doth = 'Other dry cow products used on farm in last year'),
      levels = list(casecont = list('Control' = 0, 'Case' = 1),
      dbarn = list('freestall' = 1, 'tiestall' = 2, 'other' = 3),
dout = list('pasture' = 1, 'yard/drylot' = 2, 'none' = 3, 'other' = 4),
dcprep = list('no prep.' = 1, 'washed only' = 2, 'washed and disinfected' = 3, 'dry cow therapy not used' = 4),
       dneo = list('No' = 0, 'Yes' = 1), 
       dclox = list('No' = 0, 'Yes' = 1)),
                 units = c(prod = "kg/cow/day", 
                           bscc = "'000s of cells/ml",
                           dcpct = "%"))

Example 16.1 is a logistic regression with 3 predictor variables. We can first have a look at a conditional density plot of the response variable given the proportion of dry cows treated with dry cow therapy.

cdplot(as.factor(casecont) ~ dcpct, data = nocardia)
Conditional density plot of case/control status of the herd given proportion of cows receiving dry cow therapy

Conditional density plot of case/control status of the herd given proportion of cows receiving dry cow therapy

The logistic regression model is:

mod1 <- glm(casecont ~ dcpct + dneo + dclox, 
             family = binomial("logit"),
             data = nocardia) # "logit" can be omitted as it is the default
(mod1.sum <- summary(mod1))

Call:
glm(formula = casecont ~ dcpct + dneo + dclox, family = binomial("logit"), 
    data = nocardia)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8425  -0.8915   0.1610   0.6361   2.3745  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.984272   0.772196  -3.865 0.000111 ***
dcpct        0.022668   0.007187   3.154 0.001610 ** 
dneoYes      2.212564   0.578012   3.828 0.000129 ***
dcloxYes    -1.412516   0.557197  -2.535 0.011243 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.72  on 107  degrees of freedom
Residual deviance: 107.99  on 104  degrees of freedom
AIC: 115.99

Number of Fisher Scoring iterations: 4

c(0.022668-1.96*0.007187, 0.022668+1.96*0.007187)
[1] 0.00858148 0.03675452
### or can also do:
confint.default(mod1, parm = "dcpct")
           2.5 %     97.5 %
dcpct 0.00858227 0.03675422

All variables are significant at the 5% level. An increase of one unit in the proportion of cows receiving dry cow therapy increases the log-odds to be a case herd by 0.02. The confidence interval can be constructed using normal approximation for the parameter estimate: \hat{\beta}_i \pm z^{\alpha/2}SE(\hat{\beta}_i) . The 95% confidence interval is (0.009, 0.037). But it is better to construct a profile likelihood-based confidence interval (from MASS package):

confint(mod1, parm = "dcpct")
Waiting for profiling to be done...
      2.5 %      97.5 % 
0.009211164 0.037694948 

We can get the odds by exponentiating the estimates:

cbind(exp(coef(mod1)), exp(confint(mod1)))
                             2.5 %     97.5 %
(Intercept) 0.05057629 0.009587881  0.2007101
dcpct       1.02292712 1.009253717  1.0384144
dneoYes     9.13911806 3.139924972 31.3596110
dcloxYes    0.24352979 0.078283239  0.7094844

The statistic to determine the overall significance of a logistic model is the likelihood ratio test. It compares the likelihood of the full model (with all the predictors included) with the likelihood of the null model (which contains only the intercept). It is analogous to the overall F-test in linear regression. The likelihood ratio test statistic is: G_0^2 = 2ln\frac{L}{L_0} where L is the likelihood of the full model and L_0 is the likelihood of the null model. The likelihood ratio test statistic (G_0^2 = 41.73) can be compared to a \chi^2 distribution with 3 degrees of freedom. The predictor variables are significant predictors of case-control status.

mod1.null <- glm(casecont ~ 1, family = binomial, data = nocardia)
lr.mod1 <- -(deviance(mod1) / 2)
lr.mod1.null <- -(deviance(mod1.null) / 2)
(lr <- 2 * (lr.mod1 - lr.mod1.null))
[1] 41.73248
1 - pchisq(lr, 2)
[1] 8.667767e-10

We can also compare the likelihood of the model under investigation to the likelihood of a fully saturated model (one in whcih there would be one parameter fit for each data point):

pchisq(deviance(mod1), df.residual(mod1), lower = FALSE)
[1] 0.3748198

The p-value is over 0.05 and we can conclude this model fits sufficiently well.

Finally we can evaluate a single coefficient with a Wald test. If we had barn type to the model below, we can test for an overall effect of barn type (b gives the coefficient, Sigma the variance covariance matrix of the error terms, Terms indicates which terms in the model to test (in the order they appear in the table of coefficients). The overall effect of barn is significant. We can also test for other barn against tie-stall barn.

mod1b <- glm(casecont ~ dcpct + dneo + dclox + as.factor(dbarn),
           family = binomial, data = nocardia)
summary(mod1b)

Call:
glm(formula = casecont ~ dcpct + dneo + dclox + as.factor(dbarn), 
    family = binomial, data = nocardia)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6949  -0.7853   0.1021   0.7692   2.6801  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.445696   0.854328  -2.863  0.00420 ** 
dcpct                     0.021604   0.007657   2.821  0.00478 ** 
dneoYes                   2.685280   0.677273   3.965 7.34e-05 ***
dcloxYes                 -1.235266   0.580976  -2.126  0.03349 *  
as.factor(dbarn)tiestall -1.333732   0.631925  -2.111  0.03481 *  
as.factor(dbarn)other    -0.218350   1.154293  -0.189  0.84996    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.72  on 107  degrees of freedom
Residual deviance: 102.32  on 102  degrees of freedom
AIC: 114.32

Number of Fisher Scoring iterations: 5

library(aod)
wald.test(b = coef(mod1b), Sigma = vcov(mod1b), Terms = 4:5)
Wald test:
----------

Chi-squared test:
X2 = 10.2, df = 2, P(> X2) = 0.0062

comp <- cbind(0, 0, 0, 0, 1, -1)
wald.test(b = coef(mod1b), Sigma = vcov(mod1b), L = comp)
Wald test:
----------

Chi-squared test:
X2 = 1.1, df = 1, P(> X2) = 0.29

That’s it for now…


To leave a comment for the author, please follow the link and comment on their blog: denis haine » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)