More Flexible Approaches to Model Frequency

statcompute

6 years ago

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08
load("../Downloads/DebTrivedi.rda")

library(rbenchmark)
library(CompGLM)

benchmark(replications = 3, order = "user.self",
  quasi.poisson = {
    m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson")
  },
  conway.maxwell = {
    m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient
s)
  }
)
#             test replications elapsed relative user.self sys.self user.child
# 1  quasi.poisson            3   0.084    1.000     0.084    0.000          0
# 2 conway.maxwell            3  42.466  505.548    42.316    0.048          0

summary(m1)
summary(m2)

Quasi-Poisson Regression

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.886462   0.069644  12.729  < 2e-16 ***
healthpoor       0.235673   0.046284   5.092 3.69e-07 ***
healthexcellent -0.360188   0.078441  -4.592 4.52e-06 ***
hosp             0.163246   0.015594  10.468  < 2e-16 ***
numchron         0.144652   0.011894  12.162  < 2e-16 ***
privinsyes       0.304691   0.049879   6.109 1.09e-09 ***
school           0.028953   0.004812   6.016 1.93e-09 ***
gendermale      -0.092460   0.033830  -2.733   0.0063 **
medicaidyes      0.297689   0.063787   4.667 3.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 6.697556)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23027  on 4397  degrees of freedom
AIC: NA

Conway-Maxwell Poisson Regression

Beta:
                   Estimate   Std.Error  t.value p.value
(Intercept)     -0.23385559  0.16398319  -1.4261 0.15391
healthpoor       0.03226830  0.01325437   2.4345 0.01495 *
healthexcellent -0.08361733  0.00687228 -12.1673 < 2e-16 ***
hosp             0.01743416  0.01500555   1.1618 0.24536
numchron         0.02186788  0.00209274  10.4494 < 2e-16 ***
privinsyes       0.05193645  0.00184446  28.1581 < 2e-16 ***
school           0.00490214  0.00805940   0.6083 0.54305
gendermale      -0.01485663  0.00076861 -19.3292 < 2e-16 ***
medicaidyes      0.04861617  0.00535814   9.0733 < 2e-16 ***

Zeta:
              Estimate  Std.Error t.value   p.value
(Intercept) -3.4642316  0.0093853 -369.11 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 24467.13
Log-Likelihood: -12223.56

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.