Model Non-Negative Numeric Outcomes with Zeros

September 17, 2017
By

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics")
lapply(pkgs, require, character.only = T)

data(AutoClaim, package = "cplm")
df1 <- na.omit(AutoClaim)

# SMALL NUMBER OF ZEROS
set.seed(2017)
smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE)
df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2)
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.194e+00  7.234e-02 113.277  < 2e-16 ***
# BLUEBOOK    2.047e-05  3.068e-06   6.671 3.21e-11 ***
# NPOLICY     7.274e-02  3.102e-02   2.345   0.0191 *  

MAPE(df2$CLM_AMT, fitted(tw))
# 0.6053669

zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.671e-02 175.629  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.090e-06   9.821  < 2e-16 ***
# NPOLICY     6.948e-02  2.057e-02   3.377 0.000746 ***
# Nu Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -5.3886     0.3169     -17   <2e-16 ***

MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu"))
# 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS
tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1)
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  6.854e+00  1.067e-01  64.241  < 2e-16 ***
# BLUEBOOK     1.332e-05  4.495e-06   2.963  0.00305 ** 
# NPOLICY      4.380e-02  3.664e-02   1.195  0.23196    
# CLM_FREQ5    2.064e-01  2.937e-02   7.026 2.29e-12 ***
# MVR_PTS      1.066e-01  1.510e-02   7.063 1.76e-12 ***
# INCOME      -4.606e-06  8.612e-07  -5.348 9.12e-08 ***

MAPE(df1$CLM_AMT, fitted(tw))
# 1.484484

zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.682e-02 175.218  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.091e-06   9.816  < 2e-16 ***
# NPOLICY     6.948e-02  2.067e-02   3.362 0.000778 ***
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu"))
# 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, which turned out slightly better than ZAGA.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.205e+00  5.836e-02 140.591  < 2e-16 ***
# BLUEBOOK    2.163e-05  2.976e-06   7.268 3.97e-13 ***
# NPOLICY     5.898e-02  2.681e-02   2.200   0.0278 *  
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu"))
# 1.469236

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)