GAMs and scams: Part 1

[This article was first published on R – Data Science, Insurance, Bikes, and the Meaning of Life, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

People who do statistical modeling for insurance applications usually know their way around a GLM pretty well. In pricing applications, GLMs can produce a reasonable model to serve as the basis of a rating plan, but in my experience they are usually followed by a round of “selections” – a process to incorporate business considerations and adjust the “indicated” rate relativities to arrive at those which will be implemented (and filed with regulators, if required). Selections can be driven by various constraints and considerations external to the data such as:

  • What is in the market currently?
  • What levels of rate disruption to the existing book are acceptable?
  • Would indicated rating relativities produce unaffordable rates for any customer segments?
  • Do rating differences between customer segments rise to the level of being “unfairly discriminatory”?
  • What will a regulator approve, or conversely, object to?
  • What are the IT/systems implementation constraints?
    • Deploying models as containerized prediction APIs would eliminate such a question… but instead the prediction formulas are usually written in a stored procedure with SQL tables to hold parameter values
  • Are the proposed rating factors “intuitive”, or is there a believable causal relationship to the predicted losses?

Many or all of these questions will always need consideration post-modeling since they are not driven by shortcomings in the modeling approach or data quality. Often other questions also motivate adjustments to the modeling, such as:

  • Will the proposed factors impact the customer experience negatively?
    • For example, will there be “reversals” such that rates increase, then decrease, then increase again as a policy renews?
  • Do we have sufficient experience to give high confidence in our parameter estimates?
  • Where is our experience thin?
  • Are we extrapolating wildly beyond the range of our experience, producing rates that are too high or low in the tails?

These considerations are of a different sort – they could be avoided in whole or in part with better modeling techniques.

Enter Generalized Additive Models (GAMs) as a more flexible approach than the vanilla GLM. I'm a big fan of using GAMs for most things where a GLM may be the go-to – in both pricing and non-pricing (operational, underwriting, marketing, etc.) applications. GAMs are just an extension of the GLM and as such are accesible to actuaries and others who are already familiar with that framework.

GAM Prediction Formula

From the prediction perspective, the output of a GLM is an estimate of the expected value (mean) for an observation of the form

\[
\mu_i = g^{-1}(\eta) = g^{-1}(\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + … + \beta_p x_{i,p})
\]
\(g^-1\) is the inverse of a monotonic link function, and the linear predictor (\(\eta\)) is a linear combination of covariates and estimated coefficients (aka weights, parameters). This is a pretty inflexible model. In fact, on the link scale it's a linear model – the linear predictor surface is a hyperplane slicing through a feature space of dimension \(p\). Compare to the relationship modeled by a GAM:

\[
\mu_i = g^{-1}(\eta) = g^{-1}(\beta_0 + f_1(x_{i,1}) + f_2(x_{i,2}) + … + f_p(x_{i,p}))
\]
Here we only assume \(x\) enters the linear predictor via some function \(f_x\), rather than directly in proportion to some \(\beta\). This looser assumption allows for more flexible models if we can reliably estimate the \(p\) non-linear functions \(f\) somehow.

It turns out that we can produce useful estimates for these functions – called a basis expansion of \(x\). We will take a practical look at how that is done using the approach of thin plate splines, but first let's establish our baseline for comparison using the GLMs we are familiar with.

Data

First we load some standard libraries for manipulating and visualizing data. The mgcv package was an important development in our computational ability to estimate GAMs and remains probably the most popular package in use today.

library(CASdatasets)
library(dplyr)
library(tibble)
library(magrittr)
library(ggplot2)
library(tidyr)
library(mgcv)

We will use the french motor dataset from the CASdatasets library for fooling around here.

data("freMPL6")

Let's have a glimpse at the data

glimpse(freMPL6)

## Rows: 42,400
## Columns: 20
## $ Exposure           0.333, 0.666, 0.207, 0.163, 0.077, 0.347, 0.652, 0.355, 0.644, 0.379, 0.333, 0.666, 0.407, 0.333, 0.666, …
## $ LicAge             468, 472, 169, 171, 180, 165, 169, 429, 433, 194, 318, 322, 522, 444, 448, 403, 401, 406, 530, 534, 267, …
## $ RecordBeg          2004-01-01, 2004-05-01, 2004-01-01, 2004-03-16, 2004-12-03, 2004-01-01, 2004-05-06, 2004-01-01, 2004-05-…
## $ RecordEnd          2004-05-01, NA, 2004-03-16, 2004-05-15, NA, 2004-05-06, NA, 2004-05-09, NA, 2004-05-18, 2004-05-01, NA, …
## $ Gender             Male, Male, Male, Male, Male, Female, Female, Female, Female, Male, Male, Male, Female, Male, Male, Male,…
## $ MariStat           Other, Other, Other, Other, Other, Alone, Alone, Other, Other, Other, Other, Other, Other, Other, Other, …
## $ SocioCateg         CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP50, CSP22, CSP22, CSP50, CSP50, CSP50, …
## $ VehUsage           Private, Private, Private+trip to office, Private+trip to office, Private+trip to office, Private+trip to…
## $ DrivAge            67, 68, 32, 32, 33, 32, 32, 59, 59, 51, 44, 45, 61, 55, 55, 55, 54, 54, 62, 63, 40, 40, 55, 55, 38, 38, 7…
## $ HasKmLimit         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ClaimAmount        0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 75.94985, 0.00000, 0.00000, 59.24230, 0.00000, 0.00000, 0.00…
## $ ClaimNbResp        1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 3, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ ClaimNbNonResp     0, 0, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ClaimNbParking     0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ClaimNbFireTheft   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ClaimNbWindscreen  0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ OutUseNb           0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ RiskArea           9, 9, 7, 7, 7, 6, 6, 9, 9, 11, 9, 9, 7, 5, 5, 9, 9, 9, 6, 6, 10, 10, 11, 11, 11, 11, 10, 10, 7, 7, 10, 10…
## $ BonusMalus         50, 50, 72, 68, 68, 63, 59, 51, 50, 50, 50, 50, 50, 53, 50, 95, 118, 100, 50, 50, 50, 50, 58, 72, 50, 50,…
## $ ClaimInd           0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

We have some typical policy elements here, including a pre-calculated exposure column, as well as some claim elements joined to the exposures. I'm curious how Exposure was calculated given the NAs that exist in the RecordEnd column, but we will just trust it for our purposes.

I notice also that we have a column named ClaimInd rather than ClaimCount – let's confirm this is indeed a binary indicator of 0 or 1 (or more?) claims in the exposure period.

table(freMPL6$ClaimInd)

## 
##     0     1 
## 38186  4214

One thing I want to change is to express LicAge in integral years rather than months. We would not throw out information like this normally, but for our purposes it will simplify some plotting.

freMPL6 %<>%
  as_tibble() %>%
  mutate(LicAge = floor(LicAge/12))

Let's start by looking at annualized claim frequency as a function of LicAge in years.

dat_summ <- freMPL6 %>%
  group_by(LicAge) %>%
  summarise(claim_cnt = sum(ClaimInd),
            Exposure = sum(Exposure),
            obs_freq = claim_cnt / Exposure)

## `summarise()` ungrouping output (override with `.groups` argument)

dat_summ %>%
  ggplot(aes(x = LicAge, y = obs_freq)) +
  geom_line() +
  ggtitle("Observed Claim Frequency by License Age")

plot of chunk unnamed-chunk-7

This looks close to but not quite a linear effect. So now let's move into some modeling.

A GLM Baseline

The GLM is in a sense just a special case of a GAM, where we are limited to estimates of the form \(f(x) = \beta x\). Insurance practitioners will often approach a modeling analysis by fitting a GLM as follows…

The Lame GLM

Fit a model to the raw feature(s)

lame_glm <- glm(ClaimInd ~ LicAge,
                offset = log(Exposure),
                family = poisson(link = "log"), ## skip tests of poisson assumption, save for a future blog post
                data = freMPL6)

And then check for “significant” predictors

summary(lame_glm)

## 
## Call:
## glm(formula = ClaimInd ~ LicAge, family = poisson(link = "log"), 
##     data = freMPL6, offset = log(Exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7390  -0.5252  -0.3913  -0.2396   3.2793  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.278794   0.034191 -37.401  < 2e-16 ***
## LicAge      -0.009577   0.001177  -8.139 3.98e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18267  on 42399  degrees of freedom
## Residual deviance: 18200  on 42398  degrees of freedom
## AIC: 26632
## 
## Number of Fisher Scoring iterations: 6

Finally, visually inspect the fit against the mean observed response (we use the summarised data for this)

dat_summ %>%
  mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure  ) %>%
  gather(key = "metric", value = "freq", obs_freq, lame_fit) %>%
  ggplot(aes(x = LicAge, y = freq, color = metric)) +
  geom_line() +
  ggtitle("Fit vs Observed")

plot of chunk unnamed-chunk-10

Good enough, back to studying for an exam.

Well maybe not…what's going on at the tails of our domain? Are we just seeing the process variance due to low exposure? How good is this linear curve?

gridExtra::grid.arrange(

  dat_summ %>%
    mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure  ) %>%
    gather(key = "metric", value = "freq", obs_freq, lame_fit) %>%
    ggplot(aes(x = LicAge, y = freq, color = metric)) +
    geom_line() +
    ggtitle("Fit vs Observed") +
    theme(axis.title.x = element_blank(),
          axis.ticks.x = element_blank(), 
          axis.text.x = element_blank(),
          legend.position = c(.15,.2) ),

  dat_summ %>%
    ggplot(aes(x = LicAge, y = Exposure)) +
    geom_bar(stat = 'identity'),

  heights = c(2,1)
)

plot of chunk unnamed-chunk-11

The exposure is certainly lower in the each tail, but too low? Is pattern in the observed mean at the tails purely “noise” or is there a credible “signal” here? Of course we can't tell from a graph – we need some statistics. At this point, depending on the practitioner, the linear fit might be accepted and some post-modeling selections proposed to better fit the observed if it is believed to be signal and not noise.

A little better

Given the data above an attentive modeler would probably introduce some “feature engineering” into the analysis by creating a new column, a log-transform of LicAge, to test in a second (or perhaps the first) candidate model. For many, some standard transformations of raw features are created at the outset before any models are fit.

freMPL6 %<>%  ## mutate and update data
  mutate(log_LicAge = log(LicAge))

better_glm <- glm(ClaimInd ~ log_LicAge, ## new term
                 offset = log(Exposure),
                 family = poisson(link = "log"),
                 data = freMPL6)

We recreate the summarised data for plotting

dat_summ <- freMPL6 %>%
  group_by(LicAge) %>%
  summarise(claim_cnt = sum(ClaimInd),
            Exposure = sum(Exposure),
            obs_freq = claim_cnt / Exposure,
            log_LicAge = first(log_LicAge))

## `summarise()` ungrouping output (override with `.groups` argument)

and compare the two curves graphically

gridExtra::grid.arrange(

  dat_summ %>%
    mutate(lame_fit = predict(lame_glm, ., type = "response") / Exposure,
           better_fit = predict(better_glm, ., type = "response") / Exposure) %>%
    gather(key = "metric", value = "freq", obs_freq, lame_fit, better_fit) %>%
    ggplot(aes(x = LicAge, y = freq, color = metric)) +
    geom_line() +
    ggtitle("Fit vs Observed") +
    theme(axis.title.x = element_blank(),
          axis.ticks.x = element_blank(), 
          axis.text.x = element_blank(),
          legend.position = c(.15,.2) ),

  dat_summ %>%
    ggplot(aes(x = LicAge, y = Exposure)) +
    geom_bar(stat = 'identity'),

  heights = c(2,1)
)

plot of chunk unnamed-chunk-14

The model with log(LicAge) looks like an improvement. These models are not nested and we are using 1 degree of freedom for each curve, so model comparison is straightforward – we would simply choose the model with lower residual deviance, or equivalently when model complexity is the same, the lowest AIC.

AIC(lame_glm, better_glm)

##            df      AIC
## lame_glm    2 26632.23
## better_glm  2 26615.96

Now we are running into the natural limitations of our GLM. We can model using the raw features or create simple functional transforms – log, square root, and so on are some common choices. Another option is to create a polynomial basis expansion of \(x\) – an orthogonolized basis of \({x^1, x^2,…,x^k}\) – and test each of these polynomials in the model. Actually this is a popular approach; it is an option built into some widely used modeling software and can be done in R with the poly() method. If you're already familiar with this then you are familiar with GAMs since it is what I would call a poor-man's GAM. Some of the shortcomings of this approach are:

  1. A polynomial basis is still a limited set of shapes for the basis functions
  2. Including multiple polynomials often produces wild behaviors in the tails
  3. Each basis function is either fully in or fully omitted from the model – that is, assuming we stick with the glm() method (we could move into regularization with lasso or ridge regressions)
  4. Terms are typically tested manually for inclusion starting from the lowest order and increasing
  5. Just my experience, but we often begin to over-fit the data once we go beyond an order of 2 or maybe 3

Here is how a polynomial basis expansion can be implemented

poly_glm <- glm(ClaimInd ~ poly(LicAge, degree = 3),
                 offset = log(Exposure),
                 family = poisson(link = "log"),
                 data = freMPL6)

dat_summ %>%
    mutate(poly_fit = predict(poly_glm, ., type = "response") / Exposure) %>%
    gather(key = "metric", value = "freq", obs_freq, poly_fit) %>%
    ggplot(aes(x = LicAge, y = freq, color = metric)) +
    geom_line() +
    ggtitle("GLM w/ 3rd order Polynomial Basis Expansion vs Observed") +
    theme(axis.title.x = element_blank(),
          axis.ticks.x = element_blank(), 
          axis.text.x = element_blank(),
          legend.position = c(.15,.2) )

plot of chunk unnamed-chunk-16

As you can see, the fitted curve shows a characteristic cubic shape which could become dangerous if this is the type of variable where some large outliers in \(x\) are possible.

Take a look at the model summary and make note of the terms in the model

summary(poly_glm)

## 
## Call:
## glm(formula = ClaimInd ~ poly(LicAge, degree = 3), family = poisson(link = "log"), 
##     data = freMPL6, offset = log(Exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8152  -0.5262  -0.3917  -0.2411   3.2924  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.53972    0.01548 -99.459  < 2e-16 ***
## poly(LicAge, degree = 3)1 -25.16797    3.16467  -7.953 1.82e-15 ***
## poly(LicAge, degree = 3)2   7.58463    3.25564   2.330   0.0198 *  
## poly(LicAge, degree = 3)3  -6.66467    3.28154  -2.031   0.0423 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18267  on 42399  degrees of freedom
## Residual deviance: 18189  on 42396  degrees of freedom
## AIC: 26625
## 
## Number of Fisher Scoring iterations: 6

Our leading model so far seems to be the GLM with a log-transform – but it looks like we under-estimate the risk at the low end of LicAge. Suppose our carrier plans to begin writing more non-standard risks in the lower age group in future years. This is a compelling reason to obtain the most accurate predictions we can for this segment where our historical experience is thin. Is this the best we can do?

GAM w/ Thin Plate Splines

Let us turn to 'proper' GAMs, specifically using thin plate spline bases. When estimating GAMs, you are really just fitting a GLM with some basis expansions of your covariates, similar to our example using poly() above. The differences between that approach and proper GAMs (with TPS) are:

  1. Basis expansions will be created for us starting from a much larger function space (about the same dimension as there are observations in the dataset, if I understand correctly), and then truncating it down to the \(k\) most important functions in this space via eigen decomposition
  2. The basis functions will not have a simple closed form expression like \(b(x) = x^3\), but rather a more complicated representation in terms of a radial basis kernel transformation applied to \(x\). We don't deal with the math behind this.
  3. The coefficient estimates will be regularized or “shrunk” via inclusion of a complexity penalty in the optimization

From an estimation perspective, the quantity being minimized to “fit” the GAM is (for a gaussian error)

\[\min_{f}\sum_{i=1}^{N}[y_i-f(x_i)]^2 + \lambda J(f)\]

We've seen pieces of this before – the first term is squared error loss. The \(J(f)\) in the second term is a measure of “wiggliness” of fitted function \(f\). Thinking about how one measures wiggliness, it is intuitive that we would want to consider second derivatives of \(f\) evaluated all along \(x\), squaring them so that concave and convex areas do not cancel each other out, and sum them all up. Well that is effectively how it is done:

\[J(f) =\int f’‘(x)^2 dx\]

\(\lambda\) is the smoothing parameter which controls how much penalty enters the function being optimized. It is a hyperparameter, so it must be tuned/estimated for us – this is done either through GCV (default) or through (RE)ML which is recommended by the author since it is less prone to undersmoothing, though also a bit slower. Those familiar with penalized regressions such as the Lasso and Ridge estimators (as implemented in the glmnet package for example) will recognize \(\lambda\) is also used there to denote the complexity parameter. Instead of \(L_1\) or \(L_2\) norms as measures of complexity (Lasso and Ridge, respectively), here we have the quantity \(J(f)\) measuring wiggliness.

Default setting GAM

A GAM is fit using the gam() and s() methods as follows

gam_tps_1 <- freMPL6 %>%
  gam(ClaimInd ~ s(LicAge, bs = "tp"), #TPS basis
      offset = log(Exposure),
      family = poisson(link = "log"),
      data = .)

Let's see what the summary() tells us about our model

summary(gam_tps_1)

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## ClaimInd ~ s(LicAge, bs = "tp")
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.54069    0.01549  -99.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(LicAge) 6.065  7.077  97.18  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0313   Deviance explained = 0.517%
## UBRE = -0.57107  Scale est. = 1         n = 42400

We notice that we spend about 6 degrees of freedom estimating the smooth term s(LicAge).

Compare the GAM fit to the GLM with logged feature

gridExtra::grid.arrange(

  dat_summ %>%
    mutate(better_fit = predict(better_glm, ., type = "response") / Exposure,
           gam_fit_1 = predict(gam_tps_1, ., type = "response")) %>%
    gather(key = "metric", value = "freq", obs_freq, better_fit, gam_fit_1) %>%
    ggplot(aes(x = LicAge, y = freq, color = metric)) +
    geom_line() +
    ggtitle("Fit vs Observed") +
    theme(axis.title.x = element_blank(),
          axis.ticks.x = element_blank(), 
          axis.text.x = element_blank(),
          legend.position = c(.15,.2) ),

  dat_summ %>%
    ggplot(aes(x = LicAge, y = Exposure)) +
    geom_bar(stat = 'identity'),

  heights = c(2,1)
)

plot of chunk unnamed-chunk-20

Basis Dimension

Recall a GAM models the response as a linear combination of some smooth functions \(f(x)\), for each \(x\) in the set of covariates \(\{x_1…x_p\}\).

Here we only have one \(x\), and the spline basis \(f_1(x_1)\) is estimated as

\[f_1(x_1) = \sum^{k}_{j=1}b_j(x_1)\beta_j\]

\(k\) represents the basis dimension for the smooth. What is the dimension of our selected spline basis? Well since we know we should be estimating one \(\beta\) for each basis function, let's check

coef(gam_tps_1)

## (Intercept) s(LicAge).1 s(LicAge).2 s(LicAge).3 s(LicAge).4 s(LicAge).5 s(LicAge).6 s(LicAge).7 s(LicAge).8 s(LicAge).9 
##  -1.5406883   0.3491719   0.6261047  -0.2770389   0.3752651   0.2409351  -0.3306367   0.1200613  -0.9979342  -0.4636694

So we have a spline basis of dimension 9, producing a smooth

\[f_{LicAge}(LicAge) = \sum^{9}_{j=1}b_j(LicAge)\beta_j\]

The spline basis is itself a linear combination of estimated weights (\(\beta_j\)) each scaling a basis function \(b(x)\). Add up all the \(b_j(x_1)\beta_j\) and you get the smooth function \(f_1(x_1)\).

Let's plot the 9 basis functions \(b_j(x_1)\) which we used for the basis expansion of LicAge

dat_summ %>%
  cbind(., predict(gam_tps_1, newdata = ., type = "lpmatrix") ) %>%  
  as_tibble() %>%
  gather(key = "b_j", value = "b_j_x", 7:15) %>%
  ggplot(aes(x = LicAge, y = b_j_x, color = b_j)) +
  geom_line(linetype = "dashed") +
  scale_y_continuous(name = "b_j(x)") +
  geom_line(data = . %>% mutate(gam_fit_1 = predict(gam_tps_1, ., type = "response")),
            aes(LicAge, gam_fit_1),
            color = "black")

plot of chunk unnamed-chunk-22

Notice that the first basis function is linear.

Now let's confirm our understanding by getting predictions a few different ways, we should get the same result from each.

cbind(
# default output on linear predictor scale and apply inverse link function
predict(gam_tps_1, newdata = tibble(LicAge = 1:10)) %>% as.vector() %>% exp(),

# ask for output already on response scale
predict(gam_tps_1, newdata = tibble(LicAge = 1:10), type = "response"),

# calculate dot product of Intercept and spline basis for x with corresponding estimated weights, apply inverse link
(predict(gam_tps_1, newdata = tibble(LicAge = 1:10), type = "lpmatrix") %*% coef(gam_tps_1)) %>%
  as.vector() %>%
  exp()
)

##         [,1]      [,2]      [,3]
## 1  0.4168438 0.4168438 0.4168438
## 2  0.3889351 0.3889351 0.3889351
## 3  0.3629308 0.3629308 0.3629308
## 4  0.3388892 0.3388892 0.3388892
## 5  0.3169581 0.3169581 0.3169581
## 6  0.2972915 0.2972915 0.2972915
## 7  0.2800022 0.2800022 0.2800022
## 8  0.2651402 0.2651402 0.2651402
## 9  0.2526882 0.2526882 0.2526882
## 10 0.2425674 0.2425674 0.2425674

How did we end up with 9 basis functions? By default this is set for us by the library and is also essentially arbitrary – for a univariate smooth like we have the argument k controls the basis dimension and is defaulted to a value of 10, producing a basis of dimension 9 (k-1). These are selected in order of the amount of variance explained in the radial basis transformation of \(x\) mentioned earlier. This is a vector space, and of very high dimension, so for computational efficiency we want to select a small subset of the most important functions in that space.

Since smooth functions are estimated with penalized regression (by default), k is really setting an upper limit to the degrees of freedom spent in the estimation. The d.f. upper limit \(k\) = k-1 since 1 d.f. is recovered due to an “identifiability constraint” – a constant basis function is removed from the basis expansion to maintain model identifiability since an intercept is already included in the model specification outside of the smooth term. In practice, we just want to check that the effective degrees of freedom of the smooth is not too close to the upper limit of k-1. If it is, we would increase k to select a few more basis functions from that big vector space and include them in the basis expansion of x, producing more wiggliness. This approach of creating a large vector space and eigen decomposing it to select a subset of important funtions is the computational innovation of Simon Wood and the mgcv package.

Effective Degrees of Freedom

Let's look at the summary again to inspect how complex our model is.

summary(gam_tps_1)

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## ClaimInd ~ s(LicAge, bs = "tp")
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.54069    0.01549  -99.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df Chi.sq p-value    
## s(LicAge) 6.065  7.077  97.18  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0313   Deviance explained = 0.517%
## UBRE = -0.57107  Scale est. = 1         n = 42400

The value “edf” refers to the estimated or effective degrees of freedom and is a measure of curve complexity – it is sort of like how many parameters are being estimated from the data. In a normal MLE [generalized] linear regression, 1 parameter estimate burns 1 d.f., but in a penalized regression context the complexity penalty has the effect of “shrinking” parameters out of the model. These shrunken estimates do not use a full degree of freedom because they do not fully key-in to the data – they are a compromise between the data and the penalty being imposed against complexity (similar to a prior in the bayesian framework). This is the case with our smooth term – despite a basis dimension of 9, it is more like we are estimating 5 full parameters from the data, due to shrsinkage.

Let's look at where we are spending these degrees of freedom

gam_tps_1$edf

## (Intercept) s(LicAge).1 s(LicAge).2 s(LicAge).3 s(LicAge).4 s(LicAge).5 s(LicAge).6 s(LicAge).7 s(LicAge).8 s(LicAge).9 
##  1.00000000  0.99555993  0.69450590  1.00662557  0.16930169  0.80850259 -0.01463086  0.38449257  1.02098007  1.00000000

gam_tps_1$edf %>% sum

## [1] 7.065337

Because the reported edf are approximate, terms may be slightly below 0 or above 1. We can interpret this as indicating the corresponding basis function was shrunken nearly out of the smooth term, or not shrunken at all, respectively. Here we have one basis function shrunken completely

Our edf sum balances with what is reported in the model summary (6.065, plus 1 for the Intercept). But we should keep in mind that we are also estimating a smoothing parameter which controls the amount of penalty applied. If we want edf's that accounts for this parameter as well, then you must look at edf2.

gam_tps_1$edf2

## NULL

gam_tps_1$edf2 %>% sum()

## [1] 0

Why 'Thin Plate'?

So far we have only considered the univariate case. The real power of GAMs + thin plate spline smooths comes through when we move into smooths of 2 (or more) covariates and the plots become much cooler.

gam_tps_2X <- freMPL6 %>%
  gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15), # a smooth surface on (LicAge, DrivAge)
      offset = log(Exposure),
      family = poisson(link = "log"),
      method = "REML", ## The recommended method which is less prone to over-smoothing compared to GCV
      data = .)

## plotting with persp()
steps <- 30
LicAge <- with(freMPL6, seq(min(LicAge), max(LicAge), length = steps))
DrivAge <- with(freMPL6, seq(min(DrivAge), max(DrivAge), length = steps))
newdat <- expand.grid(LicAge = LicAge,
                      DrivAge = DrivAge)

fit_2X <- matrix(predict(gam_tps_2X, newdat, type = "response"), steps, steps)

persp(LicAge, DrivAge, fit_2X, theta = 120, col = "yellow", ticktype = "detailed")

plot of chunk unnamed-chunk-27

Here we have a “thin plate” spline prediction surface estimated jointly on DrivAge * LicAge. The model specification is equivalent to including an interaction between two covariates in the GLM context. We can see that, generally speaking, risk decreases with DrivAge and with LicAge. But the story is a little more nuanced than either a tilted plane (2 linear terms) or a “saddle” (2 linear terms plus an interaction between them), with a ridge and some depressions appearing on the surface, and some tugging upward of the corners of the surface.

We can increase or decrease the tension of the thin plate, leading to more or less pressure required to flex it. This is the physical interpretation of the penalty-controlling parameter \(\lambda\) we saw earlier. Normally the optimal lambda is estimated for us; let's manually override that and make it fairly large to induce more penalty on wiggliness.

gam_tps_2X <- freMPL6 %>%
  gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15, sp = 100), #large lambda = more tension
      offset = log(Exposure),
      family = poisson(link = "log"),
      method = "REML",
      data = .)

fit_2X <- matrix(predict(gam_tps_2X, newdat, type = "response"), steps, steps)

persp(LicAge, DrivAge, fit_2X, theta = 120, col = "yellow", ticktype = "detailed")

plot of chunk unnamed-chunk-28

Here we can see that the “pressure” of the signal in our data is only strong enough to bend the plate at the corner where both DrivAge and LicAge are low, and elsewhere we have essentially a flat plate, i.e. a plane. We can infer from this that we did not shrink out the linear parts of the smooths, only the wiggly bits. That is the default behavior – shrinkage is applied in the “range space” containing the non-linear basis functions, not to the “null space” with the linear components of each term. But there are a couple ways to also apply shrinkage here if we like.

Shrinkage on the null space

There are two ways to induce shrinkage on the linear terms – shrinkage smoothers and the double penalty approach.

The shinkrage smoother approach is used by setting bs="ts" in a call to s(). We can do this for some or all terms – one advantage of this approach. This approach assumes a priori that non-linear bits should be shrunk more than the linear terms.

gam_tps_shrink <- freMPL6 %>%
  gam(ClaimInd ~ s(LicAge, DrivAge, bs = "ts", k = 15), #shrinkage version of TPS basis
      offset = log(Exposure),
      family = poisson(link = "log"),
      method = "REML",
      data = .)

fit_shrink <- matrix(predict(gam_tps_shrink, newdat, type = "response"), steps, steps)
persp(LicAge, DrivAge, fit_shrink, theta = 120, col = "yellow", ticktype = "detailed")

plot of chunk unnamed-chunk-29

The double penalty approach is used by setting select=TRUE in the gam() call. The disadvantages of this approach are that it is either on or off for all smooth terms, and now instead of one smoothing parameter we must estimate two – one each for the null space (linear terms) and the range space (non-linear terms). The advantages are that it treats both linear and non-linear terms the same from the point of view of shrinkage, and that this tends to produce more robust estimates according to the package author.

gam_tps_dblpen <- freMPL6 %>%
  gam(ClaimInd ~ s(LicAge, DrivAge, bs = "tp", k = 15),
      offset = log(Exposure),
      family = poisson(link = "log"),
      method = "REML",
      select = TRUE, ## an extra penalty placed on linear terms of smooths
      data = .)

fit_dblpen <- matrix(predict(gam_tps_dblpen, newdat, type = "response"), steps, steps)
persp(LicAge, DrivAge, fit_dblpen, theta = 120, col = "yellow", ticktype = "detailed")

plot of chunk unnamed-chunk-30

The double penalty approach produces more regularlization on the linear components than the shrinkage splines, resulting in a more level surface. These options become powerful when you have more than a handful of features and want to perform automatic feature selection, as in a Lasso regression.

Business Constraints

So far we've seen how we can improve our estimator with more flexible shapes to capture non-linear patterns and with regularization to reduce estimator variance by controlling over-fitting of the data. So let's turn back to the motivation of better reflecting business constraints.

Looking at both our univariate and bivariate models, we see that better data fitting can mean curves/surfaces that are wavy. This can be totally fine depending on the application – for example if we are modeling the result of some operational process where customers and regulators are not seeing or feeling the effects of modeling decisions so directly. On the other hand, this may be undesireably in a pricing application from the customer experience perspective – in particular when we have rating factors which increment over time (driver age, license age) then the result may be rates that swing up and down with each policy renewal/year.

Can we achieve more flexible curves while also maintaining monotonicity as we had with our log-transform GLM?

Shape Constraints with scam

library(scam)

To do this we need to turn to the scam package. scam is very similar to mgcv in that it allows us to fit GAMs with a variety of spline bases, but scam is special because it allows us to also place shape constraints on these models (“scam” stands for shape constrained additive models).

Let's take a look at a model for LicAge where we impose a monotonic decreasing condition on the spline basis

scam_mono_d <- freMPL6 %>%
  scam(ClaimInd ~ s(LicAge, bs = "mpd"), ## monotonic decreasing constraint on basis expansion
       offset = log(Exposure),
       family = poisson(link = "log"),
       data = .)

dat_summ %>%
  mutate(gam_fit_1 = predict(gam_tps_1, ., type = "response"),
         scam_fit_1 = predict(scam_mono_d, ., type = "response")) %>%
  gather(key = "metric", value = "freq", obs_freq, gam_fit_1, scam_fit_1) %>%
  ggplot(aes(x = LicAge, y = freq, color = metric)) +
  geom_line()

plot of chunk unnamed-chunk-32

We've expressed our desire for a non-wavy curve as a shape constraint within the spline basis itself, and estimated the optimal curve from the data under that condition. This is also possible in the bivariate case by using the argument bs="tedmd", though I will not do so here because it generates warnings due to not being an appropriate model for this data.

Finally let's look at the in-sample estimate of predictive performance for three of our models – the GLM w/ log-transform, the GAM without shape constraints, and our model incorporating business considerations via a monotone decreasing spline basis.

AIC(better_glm, gam_tps_1,  scam_mono_d)

##                   df      AIC
## better_glm  2.000000 26615.96
## gam_tps_1   7.065337 26614.70
## scam_mono_d 5.958760 26616.27

Based on AIC (we wouldn't entirely base these decisions on AIC) which penalizes prediction improvement for added model complexity, we would select the unconstrained GAM as our best model. But based on the business considerations and the desire to write more risks in the low age range prospectively, we would choose the GAM with monotonic constraint.

Conclusion

We started with a GLM which fit the observed experience pretty well with a simple log feature. The unconstrained GAM is able to better match the experience at the low end because it is not constrained by any functional form. But it also finds a wavy pattern throughout the experience and maybe this is not something we would choose to implement (or even a true effect we would believe). By imposing a constraint on the spline basis we can achieve a model which captures the relative risk better than the GLM, but maintains the desired shape, and still performs comparably with the GLM based on AIC (a penalized estimate of out-of-sample prediction performance).

In my mind, one of the greatest advantages in using the more flexible GAMs/non-linear methods over a GLM is the perception by our stakeholders and business partners around the quality of our analysis. With less flexible linear methods we must attach more qualifications when sharing visuals of the model output – statements like “this curve is the the best fit according to the model but some adjustments are certainly warranted here and there.” It doesn't sound like your model does a very good job then! I prefer sharing results where I can confidently say that the curves are a best estimate of the underlying signal under reasonable assumptions/constraints – allowing non-linearity but assuming monotonicity; “credibility-weighting” observed experience in areas with thin data. We may even save our own and our colleagues' time from meetings spent fiddling with curves in excel trying to find the best compromise fit between higher observed loss and the thin experience which generated it.

In a future Part 2 post I will plan to look at additional modeling capabilities, focusing on estimating random effects/mixed models within mgcv and from there moving into fully Bayesian GAMs with brms and stan.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science, Insurance, Bikes, and the Meaning of Life.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)