Generalized Linear Models – Poisson Regression

June 26, 2011
By

(This article was first published on Software for Exploratory Data Analysis and Statistical Modelling » R Environment, and kindly contributed to R-bloggers)

The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous distribution, especially as our counts should be bounded below at zero. Negative counts do not make sense.

Fast Tube
Fast Tube by Casper

To investigate using Poisson regression via the GLM framework consider a small data set on failure modes (here).

> failure.df = read.table("twomodes.dat", header = TRUE)
> failure.df
  Mode1 Mode2 Failures
1  33.3  25.3       15
2  52.2  14.4        9
3  64.7  32.5       14
4 137.0  20.5       24
5 125.9  97.6       27
6 116.3  53.6       27
7 131.7  56.6       23
8  85.0  87.3       18
9  91.9  47.8       22

The machinery is run in two modes and the objective of the analysis is to determine whether the number of failures depends on how long the machine is run in mode 1 or mode 2 and whether there is an interaction between the time in each mode to increases or decreases the number of failures.

The response for this set of data is the number of failures (count) so a Poisson regression model is considered.

> fmod1 = glm(Failures ~ Mode1 * Mode2, data = failure.df, family = poisson)
> summary(fmod1)
 
Call:
glm(formula = Failures ~ Mode1 * Mode2, family = poisson, data = failure.df)
 
Deviance Residuals: 
       1         2         3         4         5         6         7         8         9  
 0.91003  -1.15601  -0.28328  -0.10398   0.03526   0.84825  -0.49211  -0.57298   0.64821  
 
Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.105e+00  4.481e-01   4.698 2.63e-06 ***
Mode1        7.687e-03  4.285e-03   1.794   0.0729 .  
Mode2        4.703e-03  1.163e-02   0.405   0.6858    
Mode1:Mode2 -1.978e-05  1.037e-04  -0.191   0.8487    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
(Dispersion parameter for poisson family taken to be 1)
 
    Null deviance: 16.996  on 8  degrees of freedom
Residual deviance:  3.967  on 5  degrees of freedom
AIC: 55.024
 
Number of Fisher Scoring iterations: 4

The model output does not provide any support for an interaction between the number of time spent in the two different modes of operation. If we remove the interaction term and re-fit the model, using the update function, we get:

> fmod2 = update(fmod1, . ~ . - Mode1:Mode2)
> summary(fmod2)
 
Call:
glm(formula = Failures ~ Mode1 + Mode2, family = poisson, data = failure.df)
 
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.21984  -0.44735  -0.05893   0.68351   0.87510  
 
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.175168   0.255456   8.515  < 2e-16 ***
Mode1       0.007015   0.002429   2.888  0.00387 ** 
Mode2       0.002549   0.002835   0.899  0.36852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
(Dispersion parameter for poisson family taken to be 1)
 
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.0033  on 6  degrees of freedom
AIC: 53.06
 
Number of Fisher Scoring iterations: 4

This output suggests that the time of operation in mode 1 is important for determining the number of faults but the time of operation in mode 2 is not important. One last step gives us:

> fmod3 = update(fmod2, . ~ . - Mode2)
> summary(fmod3)
 
Call:
glm(formula = Failures ~ Mode1, family = poisson, data = failure.df)
 
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43194  -0.56958  -0.00745   0.66742   0.82231  
 
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.237196   0.243053   9.205  < 2e-16 ***
Mode1       0.007705   0.002264   3.403 0.000667 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
(Dispersion parameter for poisson family taken to be 1)
 
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.8078  on 7  degrees of freedom
AIC: 51.865
 
Number of Fisher Scoring iterations: 4

The diagnostic plots are shown below which do not indicate any major problems with the final model, especially given the small number of data points.

Residual Plots for Poisson Regression model

Four diagnostic plots for a Poisson regression model based on total failures

Other useful resources are provided on the Supplementary Material page.

To leave a comment for the author, please follow the link and comment on his blog: Software for Exploratory Data Analysis and Statistical Modelling » R Environment.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.