# Count data Models

**Modeling with R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction:

When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.

The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.

There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.

## Data preparation

To well understand how to model the count data we are going be using **Doctorvisits** data from **AER** package, in which the variable **visits** will be our target variable, so let’s call this data with the packages that we need along this article.

```
ssh <- suppressPackageStartupMessages
ssh(library(performance))
ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))
ssh(library(tidyverse))
ssh(library(AER))
ssh(library(broom))
data("DoctorVisits")
doc <- DoctorVisits
glimpse(doc)
```

```
Observations: 5,190
Variables: 12
$ visits
``` 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
$ gender female, female, male, male, male, female, female, female,...
$ age 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...
$ income 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$ illness 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...
$ reduced 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$ health 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...
$ private yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$ freepoor no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
$ freerepat no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$ nchronic no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...
$ lchronic no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...

This data from Australian health survey where **visits** is the number of doctor visits in past two weeks with 11 features listed above.

First we list the summary of the data to inspect any unwanted issue.

`summary(doc)`

```
visits gender age income
Min. :0.0000 male :2488 Min. :0.1900 Min. :0.0000
1st Qu.:0.0000 female:2702 1st Qu.:0.2200 1st Qu.:0.2500
Median :0.0000 Median :0.3200 Median :0.5500
Mean :0.3017 Mean :0.4064 Mean :0.5832
3rd Qu.:0.0000 3rd Qu.:0.6200 3rd Qu.:0.9000
Max. :9.0000 Max. :0.7200 Max. :1.5000
illness reduced health private freepoor
Min. :0.000 Min. : 0.0000 Min. : 0.000 no :2892 no :4968
1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000 yes:2298 yes: 222
Median :1.000 Median : 0.0000 Median : 0.000
Mean :1.432 Mean : 0.8619 Mean : 1.218
3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
Max. :5.000 Max. :14.0000 Max. :12.000
freerepat nchronic lchronic
no :4099 no :3098 no :4585
yes:1091 yes:2092 yes: 605
```

As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable **illness** should be converted to factor type since it has a few different values.

```
doc$visits<-as.integer(doc$visits)
doc$illness <- as.factor(doc$illness)
tab <- table(doc$visits)
tab
```

```
0 1 2 3 4 5 6 7 8 9
4141 782 174 30 24 9 12 12 5 1
```

The best thing we do to start analyzing the data is by displaying the **correlation coefficient** of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by **correlate** function from **corrr** package, all the variables must be of numeric type so we convert all the factor to integer.

`doc1 <-modify_if(doc, is.factor, as.integer)`

notice that we have stored the result in another object **doc1** to keep save our original data.

```
M <- correlate(doc1, method="spearman")
rplot(shave(M), colours=c("red", "white", "blue" ))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.

First let’s compare the empirical distribution of the variable **visits** and the theoretical poisson distribution with \(\lambda\) equals the visits mean `mean(doc$visits)`

, and the total number of observations is 5190.

```
pos <- dpois(0:9,0.302)*5190
both <- numeric(20)
both[1:20 %% 2 != 0] <- tab
both[1:20 %% 2 == 0] <- pos
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both,col=rep(c("red","yellow"),10),names=labels)
```

As we see the two distributions are more or less closer to each other.

Let’s now check the negative binomial distribution by first estimate the clumping parameter \(k=\frac{\bar x^2}{s^2-\bar x}\).

```
k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits))
bin<-dnbinom(0:9,0.27,mu=0.302)*5190
both1<-numeric(20)
both1[1:20 %% 2 != 0]<-tab
both1[1:20 %% 2 == 0]<-bin
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both1,col=rep(c("red","yellow"),10),names=labels)
```

With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.

**Note**: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the **zero** value. However, at the end of this article I will show two famous models to handel this type of count data called **Haurdle** model and **zero_inflated** model.

## Data partition

In oreder to evaluate our model we held out 20% of the data as testing set.

```
set.seed(123)
index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2))
train<-doc[index==1,]
test<-doc[index==2,]
```

## Poisson model

This model belongs to the generalized linear model families, so in the function **glm** we set the argument **family** to poisson. In practice this model is sufficient with a wide range of count data.

```
set.seed(123)
model1<-glm(visits~., data=train, family ="poisson")
tidy(model1)
```

```
# A tibble: 16 x 5
term estimate std.error statistic p.value
```
1 (Intercept) -2.70 0.141 -19.2 9.14e- 82
2 genderfemale 0.193 0.0620 3.11 1.88e- 3
3 age 0.436 0.184 2.37 1.77e- 2
4 income -0.161 0.0928 -1.74 8.23e- 2
5 illness1 0.944 0.113 8.35 6.76e- 17
6 illness2 1.21 0.118 10.3 1.15e- 24
7 illness3 1.11 0.132 8.43 3.51e- 17
8 illness4 1.28 0.140 9.13 7.14e- 20
9 illness5 1.44 0.139 10.4 2.34e- 25
10 reduced 0.126 0.00560 22.6 6.85e-113
11 health 0.0348 0.0112 3.10 1.91e- 3
12 privateyes 0.111 0.0795 1.39 1.64e- 1
13 freepooryes -0.344 0.190 -1.81 7.00e- 2
14 freerepatyes 0.0377 0.104 0.363 7.16e- 1
15 nchronicyes 0.0186 0.0732 0.254 7.99e- 1
16 lchronicyes 0.0255 0.0916 0.279 7.81e- 1

As we see all the variables are significant except for the income so we remove this variable and reestimate again.

```
set.seed(123)
model1<-glm(visits~.-income, data=train, family ="poisson")
tidy(model1)
```

```
# A tibble: 15 x 5
term estimate std.error statistic p.value
```
1 (Intercept) -2.83 0.121 -23.4 7.36e-121
2 genderfemale 0.213 0.0609 3.51 4.56e- 4
3 age 0.479 0.183 2.62 8.91e- 3
4 illness1 0.946 0.113 8.38 5.44e- 17
5 illness2 1.21 0.118 10.3 8.29e- 25
6 illness3 1.12 0.132 8.50 1.93e- 17
7 illness4 1.28 0.140 9.17 4.71e- 20
8 illness5 1.45 0.139 10.5 1.05e- 25
9 reduced 0.126 0.00560 22.6 1.12e-112
10 health 0.0350 0.0112 3.11 1.84e- 3
11 privateyes 0.100 0.0793 1.27 2.06e- 1
12 freepooryes -0.290 0.188 -1.55 1.22e- 1
13 freerepatyes 0.0683 0.102 0.667 5.05e- 1
14 nchronicyes 0.0171 0.0731 0.235 8.15e- 1
15 lchronicyes 0.0282 0.0914 0.308 7.58e- 1

For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be \(exp(0.47876624)=1.614082\) times larger. whereas, for categorical predictor, say gender, the female has \(exp(0.21342446)=1.23791\) larger doctor visits than male.

By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.

`glance(model1)`

```
# A tibble: 1 x 7
null.deviance df.null logLik AIC BIC deviance df.residual
```
1 4565. 4154 -2685. 5399. 5494. 3486. 4140

since the deviance value **3485.905** is lower than the degrees of freedom **4140**, we will then worry about **overdispersion** problem.

Fortunateley, the **AER** package provides a super easy way to test the significance of this difference via the function **dispersiontest**.

`dispersiontest(model1)`

```
Overdispersion test
data: model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.397176
```

If our target variable really follows poisson distribution then its variance \(V\) should be approximately equal to its mean \(\mu\), which is the null hypothesis of the following **dispersiontest** test against the alternative hypothesis that the variance of the form:

\[V=\mu+\alpha.trafo(\mu)\]

Where the **trafo** is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:

- trafo = NULL (default): \(V=(1+\alpha)\mu\)
- trafo = 1: \(V=\mu+\alpha.\mu\)
- trafo = 2: \(V=\mu+\alpha.\mu^2\)

For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model.

For the last ones if one of them is true then the negative binomial will be better than poisson model.

Now once the trafo is defined the test estimates \(\alpha\), such that:

- if \(\alpha = 0\) : equidispersion (The null hypothesis)
- if \(\alpha < 0\) : underdispersion
- if \(\alpha > 0\) : overdispersion

Therefore, the result of the test will depend on the direction of the test, where we have **two.sided**, **greater** (default) for the overdispersion, and **less** for underdispersion.

With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small **1.714e-10** then we have overdispersion problem, suggesting the use of quasi-poisson model instead.

Now let’s test the negative binomial now.

`dispersiontest(model1, trafo = 1)`

```
Overdispersion test
data: model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
0.3971763
```

The test suggested the use of negative binomial with linear function for the variance with very tiny p-value **1.714e-10**. This model is known as NB1 (with linear variance function).

`dispersiontest(model1, trafo = 2)`

```
Overdispersion test
data: model1
z = 7.4723, p-value = 3.939e-14
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
0.95488
```

If the relation is in quadratic form then this model is called NB2. And since this p-value **3.939e-14** is smaller than the previous one then NB2 could be more appropriate than NB1.

## Quasi poisson model

The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.

```
set.seed(123)
model2<-glm(visits~.-income, data=train, family ="quasipoisson")
tidy(model2)
```

```
# A tibble: 15 x 5
term estimate std.error statistic p.value
```
1 (Intercept) -2.83 0.140 -20.2 1.25e-86
2 genderfemale 0.213 0.0705 3.03 2.47e- 3
3 age 0.479 0.212 2.26 2.39e- 2
4 illness1 0.946 0.131 7.24 5.36e-13
5 illness2 1.21 0.136 8.89 9.15e-19
6 illness3 1.12 0.152 7.34 2.49e-13
7 illness4 1.28 0.162 7.92 2.91e-15
8 illness5 1.45 0.160 9.06 2.01e-19
9 reduced 0.126 0.00647 19.5 4.79e-81
10 health 0.0350 0.0130 2.69 7.14e- 3
11 privateyes 0.100 0.0917 1.09 2.74e- 1
12 freepooryes -0.290 0.217 -1.34 1.81e- 1
13 freerepatyes 0.0683 0.119 0.576 5.64e- 1
14 nchronicyes 0.0171 0.0846 0.203 8.39e- 1
15 lchronicyes 0.0282 0.106 0.266 7.90e- 1

This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.

Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the **t test** would be biased towards the rejection of the null hypothesis.

To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.

```
D1 <- tidy(model1)
colnames(D1) <- NULL
D2 <- tidy(model2)
colnames(D2) <- NULL
cbind(term=D1[,1], estimate1=D1[,2], std1=D1[,3],estimate2=D2[,2], std2=D2[,3]) %>%
mutate(dispersion= std2/std1)
```

```
term estimate1 std1 estimate2 std2 dispersion
1 (Intercept) -2.82868582 0.121003941 -2.82868582 0.140023371 1.15718
2 genderfemale 0.21342446 0.060883445 0.21342446 0.070453120 1.15718
3 age 0.47876624 0.183054884 0.47876624 0.211827495 1.15718
4 illness1 0.94643852 0.112984298 0.94643852 0.130743198 1.15718
5 illness2 1.21007304 0.117661029 1.21007304 0.136155018 1.15718
6 illness3 1.11941845 0.131726935 1.11941845 0.152431807 1.15718
7 illness4 1.28108306 0.139695671 1.28108306 0.161653071 1.15718
8 illness5 1.45198517 0.138529459 1.45198517 0.160303554 1.15718
9 reduced 0.12621548 0.005595111 0.12621548 0.006474552 1.15718
10 health 0.03497647 0.011228772 0.03497647 0.012993714 1.15718
11 privateyes 0.10033918 0.079266327 0.10033918 0.091725427 1.15718
12 freepooryes -0.29008562 0.187570838 -0.29008562 0.217053268 1.15718
13 freerepatyes 0.06829904 0.102422323 0.06829904 0.118521089 1.15718
14 nchronicyes 0.01714304 0.073082122 0.01714304 0.084569188 1.15718
15 lchronicyes 0.02818992 0.091425522 0.02818992 0.105795808 1.15718
```

**Note**: The first two columns are for the model1, and the last one are for the model 2.

Not surprisingly that the result of the last column is constant since this is exactely what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows \(std2=dispersion*std1\), with the dispersion value being estimated as **1.15718**. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to **pear** then computing sigma by hand as follows:

```
resid <- resid(model1, type = "pear")
sqrt(sum(resid^2)/4140)
```

`[1] 1.15718`

Now to test the prediction qualities of our models we use the testing set **test** by ploting the original and the predicted values.

Let’s start by the model1

```
pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(pred),col="blue")
```

If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot.

From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.

To compare different models we can use the **root mean-square error** and **mean absolute error** (all the data with zero’s included).

**Note**: Here we are using the **rmse** function from **ModelMetrics** that expects the inpute to be two vectors, and not that with the same name from the **performance** package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command `ModelMetrics::rmse`

.

```
pred <- predict.glm(model1, newdata = test, type = "response")
rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred))
maemodelp <- mae(test$visits,round(pred))
rmsemodelp
```

`[1] 0.7381921`

`maemodelp`

`[1] 0.284058`

By the same way, Now let’s evaluate the quasi-poisson model.

```
predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predq),col="blue")
```

This plot does not seem to be very different from the previous plot.

The rmse and mae for this model are computed as follows.

```
predq <- predict.glm(model2,newdata=test, type = "response")
rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp <- mae(test$visits,round(predq))
rmsemodelqp
```

`[1] 0.7381921`

`maemodelqp`

`[1] 0.284058`

we will not compare this two models until we finish with all the incoming models and we compare all the models at once.

## Negative binomial model

The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.

```
set.seed(123)
model3<-glm.nb(visits~.-income, data=train)
summary(model3)
```

```
Call:
glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8413 -0.6894 -0.5335 -0.3540 3.6726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.894820 0.135760 -21.323 < 2e-16 ***
genderfemale 0.258968 0.075352 3.437 0.000589 ***
age 0.511867 0.230297 2.223 0.026240 *
illness1 0.880644 0.123264 7.144 9.04e-13 ***
illness2 1.171615 0.130240 8.996 < 2e-16 ***
illness3 1.118067 0.149032 7.502 6.28e-14 ***
illness4 1.263367 0.165370 7.640 2.18e-14 ***
illness5 1.378166 0.169907 8.111 5.01e-16 ***
reduced 0.141389 0.008184 17.275 < 2e-16 ***
health 0.041364 0.015029 2.752 0.005918 **
privateyes 0.086188 0.095173 0.906 0.365149
freepooryes -0.375471 0.223857 -1.677 0.093487 .
freerepatyes 0.144928 0.127751 1.134 0.256602
nchronicyes 0.022111 0.087590 0.252 0.800705
lchronicyes 0.091622 0.114965 0.797 0.425477
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.9716) family taken to be 1)
Null deviance: 3208.0 on 4154 degrees of freedom
Residual deviance: 2431.2 on 4140 degrees of freedom
AIC: 5159.6
Number of Fisher Scoring iterations: 1
Theta: 0.972
Std. Err.: 0.103
2 x log-likelihood: -5127.587
```

As before we visualize the performance of this model as follows.

```
prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(prednb),col="blue")
```

Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.

```
prednb<- predict.glm(model3,newdata=test,type = "response")
rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb))
maemodelnb<-mae(test$visits,round(prednb))
rmsemodelnb
```

`[1] 0.7808085`

`maemodelnb`

`[1] 0.2966184`

we will use these ouputs further.

## Hurdle model

Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here

To perform this model we make use of the function **hurdle** from the package **pscl**.

### hurdle model with poisson distribution.

This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.

```
library(pscl)
set.seed(123)
modelhp<-hurdle(visits~. -income, data=train,dist = "poisson")
summary(modelhp)
```

```
Call:
hurdle(formula = visits ~ . - income, data = train, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.5464 -0.4686 -0.3306 -0.2075 11.0887
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.977535 0.261835 -3.733 0.000189 ***
genderfemale 0.073326 0.098034 0.748 0.454480
age 0.032762 0.287991 0.114 0.909427
illness1 0.370071 0.251920 1.469 0.141833
illness2 0.403514 0.256363 1.574 0.115489
illness3 0.201724 0.278757 0.724 0.469277
illness4 0.420285 0.277573 1.514 0.129990
illness5 0.762209 0.269809 2.825 0.004728 **
reduced 0.111640 0.007967 14.013 < 2e-16 ***
health 0.007682 0.016452 0.467 0.640554
privateyes -0.215649 0.129860 -1.661 0.096787 .
freepooryes 0.066277 0.269699 0.246 0.805879
freerepatyes -0.434941 0.166196 -2.617 0.008870 **
nchronicyes 0.109660 0.125380 0.875 0.381779
lchronicyes 0.135612 0.142766 0.950 0.342166
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.224621 0.156469 -20.609 < 2e-16 ***
genderfemale 0.305324 0.089648 3.406 0.000660 ***
age 0.700345 0.276089 2.537 0.011191 *
illness1 0.885148 0.136951 6.463 1.02e-10 ***
illness2 1.238227 0.146059 8.478 < 2e-16 ***
illness3 1.263698 0.169344 7.462 8.50e-14 ***
illness4 1.405167 0.195388 7.192 6.40e-13 ***
illness5 1.445585 0.208425 6.936 4.04e-12 ***
reduced 0.154858 0.013488 11.481 < 2e-16 ***
health 0.070464 0.019142 3.681 0.000232 ***
privateyes 0.271192 0.112751 2.405 0.016163 *
freepooryes -0.546177 0.277942 -1.965 0.049406 *
freerepatyes 0.423220 0.153994 2.748 0.005991 **
nchronicyes -0.006256 0.102033 -0.061 0.951106
lchronicyes 0.070658 0.140587 0.503 0.615251
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 22
Log-likelihood: -2581 on 30 Df
```

As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)

As we did before we plot the results.

```
predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhp),col="blue")
```

As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.

```
predhp<- predict(modelhp,newdata=test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp<-mae(test$visits,round(predhp))
rmsemodelhp
```

`[1] 0.7375374`

`maemodelhp`

`[1] 0.2850242`

### hurdle model with negative binomial distribution.

Now let’s try to use the negative binomial instead of poisson distribution.

```
set.seed(123)
modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin")
summary(modelhnb)
```

```
Call:
hurdle(formula = visits ~ . - income, data = train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.9078 -0.4515 -0.3201 -0.2022 10.6552
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.68462 2.65037 -1.390 0.1645
genderfemale 0.07432 0.17299 0.430 0.6675
age 0.26774 0.54001 0.496 0.6200
illness1 0.27678 0.34564 0.801 0.4233
illness2 0.37093 0.35241 1.053 0.2925
illness3 0.04728 0.39747 0.119 0.9053
illness4 0.40386 0.40517 0.997 0.3189
illness5 0.68213 0.41357 1.649 0.0991 .
reduced 0.15813 0.01935 8.171 3.05e-16 ***
health 0.01891 0.03291 0.575 0.5656
privateyes -0.45711 0.23118 -1.977 0.0480 *
freepooryes 0.03334 0.55282 0.060 0.9519
freerepatyes -0.59189 0.30437 -1.945 0.0518 .
nchronicyes 0.08737 0.21061 0.415 0.6783
lchronicyes 0.30274 0.25846 1.171 0.2415
Log(theta) -2.80552 2.80120 -1.002 0.3166
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.224621 0.156469 -20.609 < 2e-16 ***
genderfemale 0.305324 0.089648 3.406 0.000660 ***
age 0.700345 0.276089 2.537 0.011191 *
illness1 0.885148 0.136951 6.463 1.02e-10 ***
illness2 1.238227 0.146059 8.478 < 2e-16 ***
illness3 1.263698 0.169344 7.462 8.50e-14 ***
illness4 1.405167 0.195388 7.192 6.40e-13 ***
illness5 1.445585 0.208425 6.936 4.04e-12 ***
reduced 0.154858 0.013488 11.481 < 2e-16 ***
health 0.070464 0.019142 3.681 0.000232 ***
privateyes 0.271192 0.112751 2.405 0.016163 *
freepooryes -0.546177 0.277942 -1.965 0.049406 *
freerepatyes 0.423220 0.153994 2.748 0.005991 **
nchronicyes -0.006256 0.102033 -0.061 0.951106
lchronicyes 0.070658 0.140587 0.503 0.615251
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.0605
Number of iterations in BFGS optimization: 31
Log-likelihood: -2524 on 31 Df
```

And let’s plot the difference between the predicted and the actual values of the testing set. .

```
predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhnb),col="blue")
```

And for the metrics.

```
predhnb<- predict(modelhnb,newdata=test,type = "response")
rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb))
maemodelhnb<-mae(test$visits,round(predhnb))
rmsemodelhnb
```

`[1] 0.7408052`

`maemodelhnb`

`[1] 0.2879227`

## Zero inflated model

Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).

### Zero inflated model with poisson distribution

Here also we fit tow models one with poisson and one with negative binomial

```
set.seed(123)
modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson")
summary(modelzp)
```

```
Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.6247 -0.4791 -0.3326 -0.1783 12.3448
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.86466 0.23608 -3.663 0.00025 ***
genderfemale 0.03280 0.09078 0.361 0.71788
age 0.13330 0.26922 0.495 0.62050
illness1 0.32985 0.21846 1.510 0.13107
illness2 0.34799 0.22426 1.552 0.12072
illness3 0.20399 0.24152 0.845 0.39833
illness4 0.44019 0.24324 1.810 0.07034 .
illness5 0.72462 0.23632 3.066 0.00217 **
reduced 0.09679 0.00809 11.964 < 2e-16 ***
health 0.02269 0.01609 1.410 0.15855
privateyes -0.26390 0.12796 -2.062 0.03918 *
freepooryes 0.04860 0.27675 0.176 0.86060
freerepatyes -0.51895 0.17070 -3.040 0.00236 **
nchronicyes 0.08577 0.11489 0.747 0.45536
lchronicyes 0.10876 0.12745 0.853 0.39347
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.44814 0.34932 7.008 2.41e-12 ***
genderfemale -0.48766 0.19038 -2.561 0.010422 *
age -0.88817 0.59030 -1.505 0.132426
illness1 -0.80835 0.31247 -2.587 0.009683 **
illness2 -1.41462 0.35338 -4.003 6.25e-05 ***
illness3 -1.69205 0.44028 -3.843 0.000121 ***
illness4 -1.52225 0.46334 -3.285 0.001018 **
illness5 -1.08745 0.46493 -2.339 0.019338 *
reduced -0.14462 0.03861 -3.746 0.000180 ***
health -0.05795 0.04486 -1.292 0.196418
privateyes -0.73945 0.22597 -3.272 0.001066 **
freepooryes 0.73370 0.41402 1.772 0.076372 .
freerepatyes -1.75457 0.53937 -3.253 0.001142 **
nchronicyes 0.13230 0.22623 0.585 0.558678
lchronicyes 0.03646 0.30619 0.119 0.905217
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 80
Log-likelihood: -2579 on 30 Df
```

```
predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predzp),col="blue")
```

```
predzp<- predict(modelzp,newdata=test,type = "response")
rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp<-mae(test$visits,round(predzp))
rmsemodelzp
```

`[1] 0.7485897`

`maemodelzp`

`[1] 0.2898551`

### Zero inflated model with negative binomial distribution

Let’s this time try the negative binomial distribution.

```
set.seed(123)
modelznb<-zeroinfl(visits~., data=train,dist = "negbin")
summary(modelznb)
```

```
Call:
zeroinfl(formula = visits ~ ., data = train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.0440 -0.4582 -0.3031 -0.1680 14.2061
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.383088 0.241780 -5.720 1.06e-08 ***
genderfemale 0.038982 0.090534 0.431 0.666775
age 0.028836 0.290843 0.099 0.921022
income -0.196594 0.147560 -1.332 0.182763
illness1 0.613261 0.174653 3.511 0.000446 ***
illness2 0.692295 0.179665 3.853 0.000117 ***
illness3 0.664614 0.196063 3.390 0.000699 ***
illness4 0.760167 0.204138 3.724 0.000196 ***
illness5 0.944761 0.206098 4.584 4.56e-06 ***
reduced 0.102651 0.008776 11.697 < 2e-16 ***
health 0.044012 0.015536 2.833 0.004612 **
privateyes -0.168877 0.138680 -1.218 0.223321
freepooryes -0.422737 0.306654 -1.379 0.168034
freerepatyes -0.383568 0.163995 -2.339 0.019341 *
nchronicyes 0.033375 0.107881 0.309 0.757040
lchronicyes 0.065839 0.128987 0.510 0.609748
Log(theta) 0.473940 0.142626 3.323 0.000891 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.280216 0.532170 4.285 1.83e-05 ***
genderfemale -0.726916 0.275312 -2.640 0.00828 **
age -2.002886 0.920168 -2.177 0.02951 *
income -0.180281 0.393267 -0.458 0.64665
illness1 -0.332734 0.347965 -0.956 0.33896
illness2 -1.111993 0.449603 -2.473 0.01339 *
illness3 -0.953342 0.512702 -1.859 0.06296 .
illness4 -1.551433 0.739821 -2.097 0.03599 *
illness5 -1.229820 0.859693 -1.431 0.15256
reduced -1.298180 0.457644 -2.837 0.00456 **
health -0.001444 0.055085 -0.026 0.97909
privateyes -0.817933 0.317762 -2.574 0.01005 *
freepooryes 0.239439 0.664835 0.360 0.71874
freerepatyes -13.563347 519.415149 -0.026 0.97917
nchronicyes 0.045017 0.298234 0.151 0.88002
lchronicyes -0.163689 0.495085 -0.331 0.74093
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.6063
Number of iterations in BFGS optimization: 68
Log-likelihood: -2512 on 33 Df
```

```
predznb<- predict(modelznb,newdata=test,type = "response")
rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb<-mae(test$visits,round(predznb))
rmsemodelznb
```

`[1] 0.7309579`

`maemodelznb`

`[1] 0.2753623`

Finally let’s compare all the above models.

```
rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
rmsemodelzp,rmsemodelznb)
mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
maemodelzp,maemodelznb)
models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb")
data.frame(models,rmse,mae)%>%
arrange(rmse)
```

```
models rmse mae
1 zer_nb 0.7309579 0.2753623
2 h_pois 0.7375374 0.2850242
3 pois 0.7381921 0.2840580
4 q_pois 0.7381921 0.2840580
5 h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7 nb 0.7808085 0.2966184
```

Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value **0.7309579**, and minimum mae value **0.2753623**. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.

## Conclusion:

If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the Generealized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.

## Furhter reading:

- Michael J. Crawley, The R book, WILEY, UK, 2013. http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm

**leave a comment**for the author, please follow the link and comment on their blog:

**Modeling with R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.