Veterinary Epidemiologic Research: GLM – Evaluating Logistic Regression Models (part 3)

Posted on March 19, 2013 by denishaine in R bloggers | 0 Comments

[This article was first published on denis haine » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Third part on logistic regression (first here, second here).
Two steps in assessing the fit of the model: first is to determine if the model fits using summary measures of goodness of fit or by assessing the predictive ability of the model; second is to deterime if there’s any observations that do not fit the model or that have an influence on the model.

Covariate pattern
A covariate pattern is a unique combination of values of predictor variables.

mod3 <- glm(casecont ~ dcpct + dneo + dclox + dneo*dclox,
+             family = binomial("logit"), data = nocardia)
summary(mod3)

Call:
glm(formula = casecont ~ dcpct + dneo + dclox + dneo * dclox, 
    family = binomial("logit"), data = nocardia)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9191  -0.7682   0.1874   0.5876   2.6755  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.776896   0.993251  -3.803 0.000143 ***
dcpct             0.022618   0.007723   2.928 0.003406 ** 
dneoYes           3.184002   0.837199   3.803 0.000143 ***
dcloxYes          0.445705   1.026026   0.434 0.663999    
dneoYes:dcloxYes -2.551997   1.205075  -2.118 0.034200 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.72  on 107  degrees of freedom
Residual deviance: 103.42  on 103  degrees of freedom
AIC: 113.42

Number of Fisher Scoring iterations: 5

library(epiR)
Package epiR 0.9-45 is loaded
Type help(epi.about) for summary information

mod3.mf <- model.frame(mod3)
(mod3.cp <- epi.cp(mod3.mf[-1]))
$cov.pattern
    id  n dcpct dneo dclox
1    1  7     0   No    No
2    2 38   100  Yes    No
3    3  1    25   No    No
4    4  1     1   No    No
5    5 11   100   No   Yes
8    6  1    25  Yes   Yes
10   7  1    14  Yes    No
12   8  4    75  Yes    No
13   9  1    90  Yes   Yes
14  10  1    30   No    No
15  11  3     5  Yes    No
17  12  9   100  Yes   Yes
22  13  2    20  Yes    No
23  14  8   100   No    No
25  15  2    50  Yes   Yes
26  16  1     7   No    No
27  17  4    50  Yes    No
28  18  1    50   No    No
31  19  1    30  Yes    No
34  20  1    99   No    No
35  21  1    99  Yes   Yes
40  22  1    80  Yes   Yes
48  23  1     3  Yes    No
59  24  1     1  Yes    No
77  25  1    10   No    No
84  26  1    83   No   Yes
85  27  1    95  Yes    No
88  28  1    99  Yes    No
89  29  1    25  Yes    No
105 30  1    40  Yes    No

$id
  [1]  1  2  3  4  5  1  1  6  5  7  5  8  9 10 11 11 12  1 12  1  5 13 14  2 15
 [26] 16 17 18  1  2 19  2 14 20 21 12 14  5  8 22 14  5  5  5  1 14 14 23  2 12
 [51] 14 12 11  5 15  2  8  2 24  2  2  8  2 17  2  2  2  2 12 12 12  2  2  2  5
 [76]  2 25  2 17  2  2  2  2 26 27 13 17 28 29 14  2 12  2  2  2  2  2  2  2  2
[101]  2  2  2  2 30  2  2  5

There are 30 covariate patterns in the dataset. The pattern dcpct=100, dneo=Yes, dclox=No appears 38 times.

Pearson and deviance residuals
Residuals represent the difference between the data and the model. The Pearson residuals are comparable to standardized residuals used for linear regression models. Deviance residuals represent the contribution of each observation to the overall deviance.

residuals(mod3) # deviance residuals
residuals(mod3, "pearson") # pearson residuals

Goodness-of-fit test
All goodness-of-fit tests are based on the premise that the data will be divided into subsets and within each subset the predicted number of outcomes will be computed and compared to the observed number of outcomes. The Pearson $\chi^2$ and the deviance $\chi^2$ are based on dividing the data up into the natural covariate patterns. The Hosmer-Lemeshow test is based on a more arbitrary division of the data.

The Pearson $\chi^2$ is similar to the residual sum of squares used in linear models. It will be close in size to the deviance, but the model is fit to minimize the deviance and not the Pearson $\chi^2$ . It is thus possible even if unlikely that the $\chi^2$ could increase as a predictor is added to the model.

sum(residuals(mod3, type = "pearson")^2)
[1] 123.9656
deviance(mod3)
[1] 103.4168
1 - pchisq(deviance(mod3), df.residual(mod3))
[1] 0.4699251

The p-value is large indicating no evidence of lack of fit. However, when using the deviance statistic to assess the goodness-of-fit for a nonsaturated logistic model, the $\chi^2$ approximation for the likelihood ratio test is questionable. When the covariate pattern is almost as large as N, the deviance cannot be assumed to have a $\chi^2$ distribution.
Now the Hosmer-Lemeshow test, usually dividing by 10 the data:

hosmerlem <- function (y, yhat, g = 10) {
+   cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)),
+                  include.lowest = TRUE)
+   obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+   expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+   chisq <- sum((obs - expect)^2 / expect)
+   P <- 1 - pchisq(chisq, g - 2)
+   c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
+ }
hosmerlem(y = nocardia$casecont, yhat = fitted(mod3))
Erreur dans cut.default(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)),  (depuis #2) :
  'breaks' are not unique

The model used has many ties in its predicted probabilities (too few covariate values?) resulting in an error when running the Hosmer-Lemeshow test. Using fewer cut-points (g = 5 or 7) does not solve the problem. This is a typical example when not to use this test. A better goodness-of-fit test than Hosmer-Lemeshow and Pearson / deviance $\chi^2$ tests is the le Cessie – van Houwelingen – Copas – Hosmer unweighted sum of squares test for global goodness of fit (also here) implemented in the rms package (but you have to implement your model with the lrm function of this package):

mod3b <- lrm(casecont ~ dcpct + dneo + dclox + dneo*dclox, nocardia,
+              method = "lrm.fit", model = TRUE, x = TRUE, y = TRUE,
+              linear.predictors = TRUE, se.fit = FALSE)
residuals(mod3b, type = "gof")
Sum of squared errors     Expected value|H0                    SD 
           16.4288056            16.8235055             0.2775694 
                    Z                     P 
           -1.4219860             0.1550303

The p-value is 0.16 so there’s no evidence the model is incorrect. Even better than these tests would be to check for linearity of the predictors.

Overdispersion
Sometimes we can get a deviance that is much larger than expected if the model was correct. It can be due to the presence of outliers, sparse data or clustering of data. The approach to deal with overdispersion is to add a dispersion parameter $\sigma^2$ . It can be estimated with: $\hat{\sigma}^2 = \frac{\chi^2}{n - p}$ (p = probability of success). A half-normal plot of the residuals can help checking for outliers:

library(faraway)
halfnorm(residuals(mod1))

Half-normal plot of the residuals

The dispesion parameter of model 1 can be found as:

(sigma2 <- sum(residuals(mod1, type = "pearson")^2) / 104)
[1] 1.128778
drop1(mod1, scale = sigma2, test = "F")
Single term deletions

Model:
casecont ~ dcpct + dneo + dclox

scale:  1.128778 

       Df Deviance    AIC F value    Pr(>F)    
<none>      107.99 115.99                      
dcpct   1   119.34 124.05 10.9350  0.001296 ** 
dneo    1   125.86 129.82 17.2166 6.834e-05 ***
dclox   1   114.73 119.96  6.4931  0.012291 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Message d'avis :
In drop1.glm(mod1, scale = sigma2, test = "F") :
  le test F implique une famille 'quasibinomial'

The dispersion parameter is not very different than one (no dispersion). If dispersion was present, you could use it in the F-tests for the predictors, adding scale to drop1.

Predictive ability of the model
A ROC curve can be drawn:

predicted <- predict(mod3)
library(ROCR)
prob <- prediction(predicted, nocardia$casecont, 
+                    label.ordering = c('Control', 'Case'))
tprfpr <- performance(prob, "tpr", "fpr")
tpr <- unlist(slot(tprfpr, "y.values"))
fpr <- unlist(slot(tprfpr, "x.values"))
roc <- data.frame(tpr, fpr)
ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) + 
+   geom_abline(intercept = 0, slope = 1, colour = "gray") + 
+   ylab("Sensitivity") + 
+   xlab("1 - Specificity")

ROC curve

Identifying important observations
Like for linear regression, large positive or negative standardized residuals allow to identify points which are not well fit by the model. A plot of Pearson residuals as a function of the logit for model 1 is drawn here, with bubbles relative to size of the covariate pattern. The plot should be an horizontal band with observations between -3 and +3. Covariate patterns 25 and 26 are problematic.

nocardia$casecont.num <- as.numeric(nocardia$casecont) - 1
mod1 <- glm(casecont.num ~ dcpct + dneo + dclox, family = binomial("logit"),
+             data = nocardia) # "logit" can be omitted as it is the default
mod1.mf <- model.frame(mod1)
mod1.cp <- epi.cp(mod1.mf[-1])
nocardia.cp <- as.data.frame(cbind(cpid = mod1.cp$id,
+                                    nocardia[ , c(1, 9:11, 13)],
+                                    fit = fitted(mod1)))
### Residuals and delta betas based on covariate pattern:
mod1.obs <- as.vector(by(as.numeric(nocardia.cp$casecont.num),
+                          as.factor(nocardia.cp$cpid), FUN = sum))
mod1.fit <- as.vector(by(nocardia.cp$fit, as.factor(nocardia.cp$cpid),
+                          FUN = min))
mod1.res <- epi.cpresids(obs = mod1.obs, fit = mod1.fit,
+                          covpattern = mod1.cp)

mod1.lodds <- as.vector(by(predict(mod1), as.factor(nocardia.cp$cpid),
+                            FUN = min))

plot(mod1.lodds, mod1.res$spearson,
+      type = "n", ylab = "Pearson Residuals", xlab = "Logit")
text(mod1.lodds, mod1.res$spearson, labels = mod1.res$cpid, cex = 0.8)
symbols(mod1.lodds, mod1.res$pearson, circles = mod1.res$n, add = TRUE)

Bubble plot of standardized residuals

The hat matrix is used to calculate leverage values and other diagnostic parameters. Leverage measures the potential impact of an observation. Points with high leverage have a potential impact. Covariate patterns 2, 14, 12 and 5 have the largest leverage values.

mod1.res[sort.list(mod1.res$leverage, decreasing = TRUE), ]
cpid  leverage
2   0.74708052
14  0.54693851
12  0.54017700
5   0.42682684
11  0.21749664
1   0.19129427
...

Delta-betas provides an overall estimate of the effect of the $j^{th}$ covariate pattern on the regression coefficients. It is analogous to Cook’s distance in linear regression. Covariate pattern 2 has the largest delta-beta (and represents 38 observations).

mod1.res[sort.list(mod1.res$sdeltabeta, decreasing = TRUE), ]
cpid sdeltabeta
2    7.890878470
14   3.983840529
...

To leave a comment for the author, please follow the link and comment on their blog: denis haine » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Veterinary Epidemiologic Research: GLM – Evaluating Logistic Regression Models (part 3)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)