[This article was first published on Modeling with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction:

When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.

The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.

There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.

## Data preparation

To well understand how to model the count data we are going be using Doctorvisits data from AER package, in which the variable visits will be our target variable, so let’s call this data with the packages that we need along this article.

ssh <- suppressPackageStartupMessages
ssh(library(performance))
ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))
ssh(library(tidyverse))
ssh(library(AER))
ssh(library(broom))
data("DoctorVisits")
doc <- DoctorVisits
glimpse(doc)
Observations: 5,190
Variables: 12
$visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...$ gender    <fct> female, female, male, male, male, female, female, female,...
$age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$freepoor <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$nchronic <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...

This data from Australian health survey where visits is the number of doctor visits in past two weeks with 11 features listed above.

First we list the summary of the data to inspect any unwanted issue.

summary(doc)
visits          gender          age             income
Min.   :0.0000   male  :2488   Min.   :0.1900   Min.   :0.0000
1st Qu.:0.0000   female:2702   1st Qu.:0.2200   1st Qu.:0.2500
Median :0.0000                 Median :0.3200   Median :0.5500
Mean   :0.3017                 Mean   :0.4064   Mean   :0.5832
3rd Qu.:0.0000                 3rd Qu.:0.6200   3rd Qu.:0.9000
Max.   :9.0000                 Max.   :0.7200   Max.   :1.5000
illness         reduced            health       private    freepoor
Min.   :0.000   Min.   : 0.0000   Min.   : 0.000   no :2892   no :4968
1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000   yes:2298   yes: 222
Median :1.000   Median : 0.0000   Median : 0.000
Mean   :1.432   Mean   : 0.8619   Mean   : 1.218
3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000
Max.   :5.000   Max.   :14.0000   Max.   :12.000
freerepat  nchronic   lchronic
no :4099   no :3098   no :4585
yes:1091   yes:2092   yes: 605



As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable illness should be converted to factor type since it has a few different values.

doc$visits<-as.integer(doc$visits)
doc$illness <- as.factor(doc$illness)
tab <- table(doc$visits) tab 0 1 2 3 4 5 6 7 8 9 4141 782 174 30 24 9 12 12 5 1  The best thing we do to start analyzing the data is by displaying the correlation coefficient of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by correlate function from corrr package, all the variables must be of numeric type so we convert all the factor to integer. doc1 <-modify_if(doc, is.factor, as.integer) notice that we have stored the result in another object doc1 to keep save our original data. M <- correlate(doc1, method="spearman") rplot(shave(M), colours=c("red", "white", "blue" ))+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation. First let’s compare the empirical distribution of the variable visits and the theoretical poisson distribution with $$\lambda$$ equals the visits mean mean(doc$visits), and the total number of observations is 5190.

pos <- dpois(0:9,0.302)*5190
both <- numeric(20)
both[1:20 %% 2 != 0] <- tab
both[1:20 %% 2 == 0] <- pos
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both,col=rep(c("red","yellow"),10),names=labels)

As we see the two distributions are more or less closer to each other. Let’s now check the negative binomial distribution by first estimate the clumping parameter $$k=\frac{\bar x^2}{s^2-\bar x}$$.

k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits)) bin<-dnbinom(0:9,0.27,mu=0.302)*5190 both1<-numeric(20) both1[1:20 %% 2 != 0]<-tab both1[1:20 %% 2 == 0]<-bin labels<-character(20) labels[1:20 %% 2==0]<-as.character(0:9) barplot(both1,col=rep(c("red","yellow"),10),names=labels) With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution. Note: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the zero value. However, at the end of this article I will show two famous models to handel this type of count data called Haurdle model and zero_inflated model. ## Data partition In oreder to evaluate our model we held out 20% of the data as testing set. set.seed(123) index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2)) train<-doc[index==1,] test<-doc[index==2,] ## Poisson model This model belongs to the generalized linear model families, so in the function glm we set the argument family to poisson. In practice this model is sufficient with a wide range of count data. set.seed(123) model1<-glm(visits~., data=train, family ="poisson") tidy(model1) # A tibble: 16 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.70 0.141 -19.2 9.14e- 82 2 genderfemale 0.193 0.0620 3.11 1.88e- 3 3 age 0.436 0.184 2.37 1.77e- 2 4 income -0.161 0.0928 -1.74 8.23e- 2 5 illness1 0.944 0.113 8.35 6.76e- 17 6 illness2 1.21 0.118 10.3 1.15e- 24 7 illness3 1.11 0.132 8.43 3.51e- 17 8 illness4 1.28 0.140 9.13 7.14e- 20 9 illness5 1.44 0.139 10.4 2.34e- 25 10 reduced 0.126 0.00560 22.6 6.85e-113 11 health 0.0348 0.0112 3.10 1.91e- 3 12 privateyes 0.111 0.0795 1.39 1.64e- 1 13 freepooryes -0.344 0.190 -1.81 7.00e- 2 14 freerepatyes 0.0377 0.104 0.363 7.16e- 1 15 nchronicyes 0.0186 0.0732 0.254 7.99e- 1 16 lchronicyes 0.0255 0.0916 0.279 7.81e- 1 As we see all the variables are significant except for the income so we remove this variable and reestimate again. set.seed(123) model1<-glm(visits~.-income, data=train, family ="poisson") tidy(model1) # A tibble: 15 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.83 0.121 -23.4 7.36e-121 2 genderfemale 0.213 0.0609 3.51 4.56e- 4 3 age 0.479 0.183 2.62 8.91e- 3 4 illness1 0.946 0.113 8.38 5.44e- 17 5 illness2 1.21 0.118 10.3 8.29e- 25 6 illness3 1.12 0.132 8.50 1.93e- 17 7 illness4 1.28 0.140 9.17 4.71e- 20 8 illness5 1.45 0.139 10.5 1.05e- 25 9 reduced 0.126 0.00560 22.6 1.12e-112 10 health 0.0350 0.0112 3.11 1.84e- 3 11 privateyes 0.100 0.0793 1.27 2.06e- 1 12 freepooryes -0.290 0.188 -1.55 1.22e- 1 13 freerepatyes 0.0683 0.102 0.667 5.05e- 1 14 nchronicyes 0.0171 0.0731 0.235 8.15e- 1 15 lchronicyes 0.0282 0.0914 0.308 7.58e- 1 For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be $$exp(0.47876624)=1.614082$$ times larger. whereas, for categorical predictor, say gender, the female has $$exp(0.21342446)=1.23791$$ larger doctor visits than male. By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics. glance(model1) # A tibble: 1 x 7 null.deviance df.null logLik AIC BIC deviance df.residual <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> 1 4565. 4154 -2685. 5399. 5494. 3486. 4140 since the deviance value 3485.905 is lower than the degrees of freedom 4140, we will then worry about overdispersion problem. Fortunateley, the AER package provides a super easy way to test the significance of this difference via the function dispersiontest. dispersiontest(model1) Overdispersion test data: model1 z = 6.278, p-value = 1.714e-10 alternative hypothesis: true dispersion is greater than 1 sample estimates: dispersion 1.397176  If our target variable really follows poisson distribution then its variance $$V$$ should be approximately equal to its mean $$\mu$$, which is the null hypothesis of the following dispersiontest test against the alternative hypothesis that the variance of the form: $V=\mu+\alpha.trafo(\mu)$ Where the trafo is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are: • trafo = NULL (default): $$V=(1+\alpha)\mu$$ • trafo = 1: $$V=\mu+\alpha.\mu$$ • trafo = 2: $$V=\mu+\alpha.\mu^2$$ For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model. For the last ones if one of them is true then the negative binomial will be better than poisson model. Now once the trafo is defined the test estimates $$\alpha$$, such that: • if $$\alpha = 0$$ : equidispersion (The null hypothesis) • if $$\alpha < 0$$ : underdispersion • if $$\alpha > 0$$ : overdispersion Therefore, the result of the test will depend on the direction of the test, where we have two.sided, greater (default) for the overdispersion, and less for underdispersion. With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small 1.714e-10 then we have overdispersion problem, suggesting the use of quasi-poisson model instead. Now let’s test the negative binomial now. dispersiontest(model1, trafo = 1) Overdispersion test data: model1 z = 6.278, p-value = 1.714e-10 alternative hypothesis: true alpha is greater than 0 sample estimates: alpha 0.3971763  The test suggested the use of negative binomial with linear function for the variance with very tiny p-value 1.714e-10. This model is known as NB1 (with linear variance function). dispersiontest(model1, trafo = 2) Overdispersion test data: model1 z = 7.4723, p-value = 3.939e-14 alternative hypothesis: true alpha is greater than 0 sample estimates: alpha 0.95488  If the relation is in quadratic form then this model is called NB2. And since this p-value 3.939e-14 is smaller than the previous one then NB2 could be more appropriate than NB1. ## Quasi poisson model The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one. set.seed(123) model2<-glm(visits~.-income, data=train, family ="quasipoisson") tidy(model2) # A tibble: 15 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.83 0.140 -20.2 1.25e-86 2 genderfemale 0.213 0.0705 3.03 2.47e- 3 3 age 0.479 0.212 2.26 2.39e- 2 4 illness1 0.946 0.131 7.24 5.36e-13 5 illness2 1.21 0.136 8.89 9.15e-19 6 illness3 1.12 0.152 7.34 2.49e-13 7 illness4 1.28 0.162 7.92 2.91e-15 8 illness5 1.45 0.160 9.06 2.01e-19 9 reduced 0.126 0.00647 19.5 4.79e-81 10 health 0.0350 0.0130 2.69 7.14e- 3 11 privateyes 0.100 0.0917 1.09 2.74e- 1 12 freepooryes -0.290 0.217 -1.34 1.81e- 1 13 freerepatyes 0.0683 0.119 0.576 5.64e- 1 14 nchronicyes 0.0171 0.0846 0.203 8.39e- 1 15 lchronicyes 0.0282 0.106 0.266 7.90e- 1 This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors. Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the t test would be biased towards the rejection of the null hypothesis. To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one. D1 <- tidy(model1) colnames(D1) <- NULL D2 <- tidy(model2) colnames(D2) <- NULL cbind(term=D1[,1], estimate1=D1[,2], std1=D1[,3],estimate2=D2[,2], std2=D2[,3]) %>% mutate(dispersion= std2/std1) term estimate1 std1 estimate2 std2 dispersion 1 (Intercept) -2.82868582 0.121003941 -2.82868582 0.140023371 1.15718 2 genderfemale 0.21342446 0.060883445 0.21342446 0.070453120 1.15718 3 age 0.47876624 0.183054884 0.47876624 0.211827495 1.15718 4 illness1 0.94643852 0.112984298 0.94643852 0.130743198 1.15718 5 illness2 1.21007304 0.117661029 1.21007304 0.136155018 1.15718 6 illness3 1.11941845 0.131726935 1.11941845 0.152431807 1.15718 7 illness4 1.28108306 0.139695671 1.28108306 0.161653071 1.15718 8 illness5 1.45198517 0.138529459 1.45198517 0.160303554 1.15718 9 reduced 0.12621548 0.005595111 0.12621548 0.006474552 1.15718 10 health 0.03497647 0.011228772 0.03497647 0.012993714 1.15718 11 privateyes 0.10033918 0.079266327 0.10033918 0.091725427 1.15718 12 freepooryes -0.29008562 0.187570838 -0.29008562 0.217053268 1.15718 13 freerepatyes 0.06829904 0.102422323 0.06829904 0.118521089 1.15718 14 nchronicyes 0.01714304 0.073082122 0.01714304 0.084569188 1.15718 15 lchronicyes 0.02818992 0.091425522 0.02818992 0.105795808 1.15718 Note: The first two columns are for the model1, and the last one are for the model 2. Not surprisingly that the result of the last column is constant since this is exactely what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows $$std2=dispersion*std1$$, with the dispersion value being estimated as 1.15718. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to pear then computing sigma by hand as follows: resid <- resid(model1, type = "pear") sqrt(sum(resid^2)/4140) [1] 1.15718 Now to test the prediction qualities of our models we use the testing set test by ploting the original and the predicted values. Let’s start by the model1 pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(pred),col="blue")

If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot. From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.

To compare different models we can use the root mean-square error and mean absolute error (all the data with zero’s included). Note: Here we are using the rmse function from ModelMetrics that expects the inpute to be two vectors, and not that with the same name from the performance package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command ModelMetrics::rmse.

pred <- predict.glm(model1, newdata = test, type = "response")
rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred)) maemodelp <- mae(test$visits,round(pred))
rmsemodelp
[1] 0.7381921
maemodelp
[1] 0.284058

By the same way, Now let’s evaluate the quasi-poisson model.

predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predq),col="blue") This plot does not seem to be very different from the previous plot. The rmse and mae for this model are computed as follows. predq <- predict.glm(model2,newdata=test, type = "response") rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp <- mae(test$visits,round(predq)) rmsemodelqp [1] 0.7381921 maemodelqp [1] 0.284058 we will not compare this two models until we finish with all the incoming models and we compare all the models at once. ## Negative binomial model The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem. set.seed(123) model3<-glm.nb(visits~.-income, data=train) summary(model3) Call: glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611, link = log) Deviance Residuals: Min 1Q Median 3Q Max -1.8413 -0.6894 -0.5335 -0.3540 3.6726 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.894820 0.135760 -21.323 < 2e-16 *** genderfemale 0.258968 0.075352 3.437 0.000589 *** age 0.511867 0.230297 2.223 0.026240 * illness1 0.880644 0.123264 7.144 9.04e-13 *** illness2 1.171615 0.130240 8.996 < 2e-16 *** illness3 1.118067 0.149032 7.502 6.28e-14 *** illness4 1.263367 0.165370 7.640 2.18e-14 *** illness5 1.378166 0.169907 8.111 5.01e-16 *** reduced 0.141389 0.008184 17.275 < 2e-16 *** health 0.041364 0.015029 2.752 0.005918 ** privateyes 0.086188 0.095173 0.906 0.365149 freepooryes -0.375471 0.223857 -1.677 0.093487 . freerepatyes 0.144928 0.127751 1.134 0.256602 nchronicyes 0.022111 0.087590 0.252 0.800705 lchronicyes 0.091622 0.114965 0.797 0.425477 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(0.9716) family taken to be 1) Null deviance: 3208.0 on 4154 degrees of freedom Residual deviance: 2431.2 on 4140 degrees of freedom AIC: 5159.6 Number of Fisher Scoring iterations: 1 Theta: 0.972 Std. Err.: 0.103 2 x log-likelihood: -5127.587  As before we visualize the performance of this model as follows. prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(prednb),col="blue")

Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.

prednb<- predict.glm(model3,newdata=test,type = "response")
rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb)) maemodelnb<-mae(test$visits,round(prednb))
rmsemodelnb
[1] 0.7808085
maemodelnb
[1] 0.2966184

we will use these ouputs further.

## Hurdle model

Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here

To perform this model we make use of the function hurdle from the package pscl.

### hurdle model with poisson distribution.

This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.

library(pscl)
set.seed(123)
modelhp<-hurdle(visits~. -income, data=train,dist = "poisson")
summary(modelhp)

Call:
hurdle(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
Min      1Q  Median      3Q     Max
-1.5464 -0.4686 -0.3306 -0.2075 11.0887

Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.977535   0.261835  -3.733 0.000189 ***
genderfemale  0.073326   0.098034   0.748 0.454480
age           0.032762   0.287991   0.114 0.909427
illness1      0.370071   0.251920   1.469 0.141833
illness2      0.403514   0.256363   1.574 0.115489
illness3      0.201724   0.278757   0.724 0.469277
illness4      0.420285   0.277573   1.514 0.129990
illness5      0.762209   0.269809   2.825 0.004728 **
reduced       0.111640   0.007967  14.013  < 2e-16 ***
health        0.007682   0.016452   0.467 0.640554
privateyes   -0.215649   0.129860  -1.661 0.096787 .
freepooryes   0.066277   0.269699   0.246 0.805879
freerepatyes -0.434941   0.166196  -2.617 0.008870 **
nchronicyes   0.109660   0.125380   0.875 0.381779
lchronicyes   0.135612   0.142766   0.950 0.342166
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -3.224621   0.156469 -20.609  < 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  < 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  < 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *
freepooryes  -0.546177   0.277942  -1.965 0.049406 *
freerepatyes  0.423220   0.153994   2.748 0.005991 **
nchronicyes  -0.006256   0.102033  -0.061 0.951106
lchronicyes   0.070658   0.140587   0.503 0.615251
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in BFGS optimization: 22
Log-likelihood: -2581 on 30 Df

As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)
As we did before we plot the results.

predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predhp),col="blue") As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics. predhp<- predict(modelhp,newdata=test, type = "response") rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp<-mae(test$visits,round(predhp)) rmsemodelhp [1] 0.7375374 maemodelhp [1] 0.2850242 ### hurdle model with negative binomial distribution. Now let’s try to use the negative binomial instead of poisson distribution. set.seed(123) modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin") summary(modelhnb) Call: hurdle(formula = visits ~ . - income, data = train, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -0.9078 -0.4515 -0.3201 -0.2022 10.6552 Count model coefficients (truncated negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.68462 2.65037 -1.390 0.1645 genderfemale 0.07432 0.17299 0.430 0.6675 age 0.26774 0.54001 0.496 0.6200 illness1 0.27678 0.34564 0.801 0.4233 illness2 0.37093 0.35241 1.053 0.2925 illness3 0.04728 0.39747 0.119 0.9053 illness4 0.40386 0.40517 0.997 0.3189 illness5 0.68213 0.41357 1.649 0.0991 . reduced 0.15813 0.01935 8.171 3.05e-16 *** health 0.01891 0.03291 0.575 0.5656 privateyes -0.45711 0.23118 -1.977 0.0480 * freepooryes 0.03334 0.55282 0.060 0.9519 freerepatyes -0.59189 0.30437 -1.945 0.0518 . nchronicyes 0.08737 0.21061 0.415 0.6783 lchronicyes 0.30274 0.25846 1.171 0.2415 Log(theta) -2.80552 2.80120 -1.002 0.3166 Zero hurdle model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.224621 0.156469 -20.609 < 2e-16 *** genderfemale 0.305324 0.089648 3.406 0.000660 *** age 0.700345 0.276089 2.537 0.011191 * illness1 0.885148 0.136951 6.463 1.02e-10 *** illness2 1.238227 0.146059 8.478 < 2e-16 *** illness3 1.263698 0.169344 7.462 8.50e-14 *** illness4 1.405167 0.195388 7.192 6.40e-13 *** illness5 1.445585 0.208425 6.936 4.04e-12 *** reduced 0.154858 0.013488 11.481 < 2e-16 *** health 0.070464 0.019142 3.681 0.000232 *** privateyes 0.271192 0.112751 2.405 0.016163 * freepooryes -0.546177 0.277942 -1.965 0.049406 * freerepatyes 0.423220 0.153994 2.748 0.005991 ** nchronicyes -0.006256 0.102033 -0.061 0.951106 lchronicyes 0.070658 0.140587 0.503 0.615251 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = 0.0605 Number of iterations in BFGS optimization: 31 Log-likelihood: -2524 on 31 Df And let’s plot the difference between the predicted and the actual values of the testing set. . predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhnb),col="blue")

And for the metrics.

predhnb<- predict(modelhnb,newdata=test,type = "response")
rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb)) maemodelhnb<-mae(test$visits,round(predhnb))
rmsemodelhnb
[1] 0.7408052
maemodelhnb
[1] 0.2879227

## Zero inflated model

Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).

### Zero inflated model with poisson distribution

Here also we fit tow models one with poisson and one with negative binomial

set.seed(123)
modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson")
summary(modelzp)

Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
Min      1Q  Median      3Q     Max
-1.6247 -0.4791 -0.3326 -0.1783 12.3448

Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.86466    0.23608  -3.663  0.00025 ***
genderfemale  0.03280    0.09078   0.361  0.71788
age           0.13330    0.26922   0.495  0.62050
illness1      0.32985    0.21846   1.510  0.13107
illness2      0.34799    0.22426   1.552  0.12072
illness3      0.20399    0.24152   0.845  0.39833
illness4      0.44019    0.24324   1.810  0.07034 .
illness5      0.72462    0.23632   3.066  0.00217 **
reduced       0.09679    0.00809  11.964  < 2e-16 ***
health        0.02269    0.01609   1.410  0.15855
privateyes   -0.26390    0.12796  -2.062  0.03918 *
freepooryes   0.04860    0.27675   0.176  0.86060
freerepatyes -0.51895    0.17070  -3.040  0.00236 **
nchronicyes   0.08577    0.11489   0.747  0.45536
lchronicyes   0.10876    0.12745   0.853  0.39347

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   2.44814    0.34932   7.008 2.41e-12 ***
genderfemale -0.48766    0.19038  -2.561 0.010422 *
age          -0.88817    0.59030  -1.505 0.132426
illness1     -0.80835    0.31247  -2.587 0.009683 **
illness2     -1.41462    0.35338  -4.003 6.25e-05 ***
illness3     -1.69205    0.44028  -3.843 0.000121 ***
illness4     -1.52225    0.46334  -3.285 0.001018 **
illness5     -1.08745    0.46493  -2.339 0.019338 *
reduced      -0.14462    0.03861  -3.746 0.000180 ***
health       -0.05795    0.04486  -1.292 0.196418
privateyes   -0.73945    0.22597  -3.272 0.001066 **
freepooryes   0.73370    0.41402   1.772 0.076372 .
freerepatyes -1.75457    0.53937  -3.253 0.001142 **
nchronicyes   0.13230    0.22623   0.585 0.558678
lchronicyes   0.03646    0.30619   0.119 0.905217
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in BFGS optimization: 80
Log-likelihood: -2579 on 30 Df
predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predzp),col="blue") predzp<- predict(modelzp,newdata=test,type = "response") rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp<-mae(test$visits,round(predzp)) rmsemodelzp [1] 0.7485897 maemodelzp [1] 0.2898551 ### Zero inflated model with negative binomial distribution Let’s this time try the negative binomial distribution. set.seed(123) modelznb<-zeroinfl(visits~., data=train,dist = "negbin") summary(modelznb) Call: zeroinfl(formula = visits ~ ., data = train, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -1.0440 -0.4582 -0.3031 -0.1680 14.2061 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.383088 0.241780 -5.720 1.06e-08 *** genderfemale 0.038982 0.090534 0.431 0.666775 age 0.028836 0.290843 0.099 0.921022 income -0.196594 0.147560 -1.332 0.182763 illness1 0.613261 0.174653 3.511 0.000446 *** illness2 0.692295 0.179665 3.853 0.000117 *** illness3 0.664614 0.196063 3.390 0.000699 *** illness4 0.760167 0.204138 3.724 0.000196 *** illness5 0.944761 0.206098 4.584 4.56e-06 *** reduced 0.102651 0.008776 11.697 < 2e-16 *** health 0.044012 0.015536 2.833 0.004612 ** privateyes -0.168877 0.138680 -1.218 0.223321 freepooryes -0.422737 0.306654 -1.379 0.168034 freerepatyes -0.383568 0.163995 -2.339 0.019341 * nchronicyes 0.033375 0.107881 0.309 0.757040 lchronicyes 0.065839 0.128987 0.510 0.609748 Log(theta) 0.473940 0.142626 3.323 0.000891 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.280216 0.532170 4.285 1.83e-05 *** genderfemale -0.726916 0.275312 -2.640 0.00828 ** age -2.002886 0.920168 -2.177 0.02951 * income -0.180281 0.393267 -0.458 0.64665 illness1 -0.332734 0.347965 -0.956 0.33896 illness2 -1.111993 0.449603 -2.473 0.01339 * illness3 -0.953342 0.512702 -1.859 0.06296 . illness4 -1.551433 0.739821 -2.097 0.03599 * illness5 -1.229820 0.859693 -1.431 0.15256 reduced -1.298180 0.457644 -2.837 0.00456 ** health -0.001444 0.055085 -0.026 0.97909 privateyes -0.817933 0.317762 -2.574 0.01005 * freepooryes 0.239439 0.664835 0.360 0.71874 freerepatyes -13.563347 519.415149 -0.026 0.97917 nchronicyes 0.045017 0.298234 0.151 0.88002 lchronicyes -0.163689 0.495085 -0.331 0.74093 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 1.6063 Number of iterations in BFGS optimization: 68 Log-likelihood: -2512 on 33 Df predznb<- predict(modelznb,newdata=test,type = "response") rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb<-mae(test\$visits,round(predznb))
rmsemodelznb
[1] 0.7309579
maemodelznb
[1] 0.2753623

Finally let’s compare all the above models.

rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
rmsemodelzp,rmsemodelznb)
mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
maemodelzp,maemodelznb)
models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb")

data.frame(models,rmse,mae)%>%
arrange(rmse)
models      rmse       mae
1   zer_nb 0.7309579 0.2753623
2   h_pois 0.7375374 0.2850242
3     pois 0.7381921 0.2840580
4   q_pois 0.7381921 0.2840580
5     h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7       nb 0.7808085 0.2966184

Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value 0.7309579, and minimum mae value 0.2753623. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.

## Conclusion:

If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the Generealized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.

## Furhter reading:

To leave a comment for the author, please follow the link and comment on their blog: Modeling with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)