We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as NonLife Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK  US).
At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.
In the following, we will assume that the reader has a copy of the book and a working installation of R. We will be using the data.table, foreach, and ggplot2 packages which are not part of the standard distribution, so the reader should install them first (e.g. by executing install.packages(c("data.table", "foreach", "ggplot2"), dependencies = TRUE)
from within an R session).
The beginning of the code is not so important, but here goes:
#!/usr/bin/Rscript ## PricingGLM2.r  Code for Chapter 2 of "NonLife Insurance Pricing with GLM" ## Copyright © 2012 CYBAEA Limited (http://www.cybaea.net) ## @book{ohlsson2010non, ## title={NonLife Insurance Pricing with Generalized Linear Models}, ## author={Ohlsson, E. and Johansson, B.}, ## isbn={9783642107900}, ## series={Eaa Series: Textbook}, ## url={http://books.google.com/books?id=l4rjeflJ\_bIC}, ## year={2010}, ## publisher={Springer Verlag} ## }
Example 2.5: Moped insurance continued
We continue the moped insurance example, and we use the data that we saved in our Chapter 1 session. The goal is to reproduce Table 2.7 so we start building that as a data frame after loading the data.
## Load the data from last. if (!exists("table.1.2")) load("table.1.2.RData") library("foreach") ## We are looking to reproduce table 2.7 which we start building here, ## adding columns as we go. table.2.7 < data.frame(rating.factor = c(rep("Vehicle class", nlevels(table.1.2$premiekl)), rep("Vehicle age", nlevels(table.1.2$moptva)), rep("Zone", nlevels(table.1.2$zon))), class = c(levels(table.1.2$premiekl), levels(table.1.2$moptva), levels(table.1.2$zon)), stringsAsFactors = FALSE)
We next calculate the duration and number of claims for each level of each rating factor. We also set the contrasts for the levels, using the same idiom as in our Chapter 1 session. The foreach package is convenient to use here, but you can of course do it with a normal loop and a couple of variables.
## Calculate duration per rating factor level and also set the ## contrasts (using the same idiom as in the code for the previous ## chapter). We use foreach here to execute the loop both for its ## sideeffect (setting the contrasts) and to accumulate the sums. new.cols < foreach (rating.factor = c("premiekl", "moptva", "zon"), .combine = rbind) %do% { nclaims < tapply(table.1.2$antskad, table.1.2[[rating.factor]], sum) sums < tapply(table.1.2$dur, table.1.2[[rating.factor]], sum) n.levels < nlevels(table.1.2[[rating.factor]]) contrasts(table.1.2[[rating.factor]]) < contr.treatment(n.levels)[rank(sums, ties.method = "first"), ] data.frame(duration = sums, n.claims = nclaims) } table.2.7 < cbind(table.2.7, new.cols) rm(new.cols)
Next, we need to build the frequency and severity models separately, as in the discussion at the beginning of section 2.3.4 on page 34.
Frequency model
Note here the use of the offset()
term as opposed to the weights=
argument we have used before. The offset is log(dur)
because our link function is log
.
See help("Insurance", package = "MASS")
for a similar example and sections 7.1 (p. 189190) and 7.3 of Modern Applied Statistics with S
(Amazon
UK 
US
)
for a (very brief) discussion.
model.frequency < glm(antskad ~ premiekl + moptva + zon + offset(log(dur)), data = table.1.2, family = poisson) rels < coef( model.frequency ) rels < exp( rels[1] + rels[1] ) / exp( rels[1] ) table.2.7$rels.frequency < c(c(1, rels[1])[rank(table.2.7$duration[1:2], ties.method = "first")], c(1, rels[2])[rank(table.2.7$duration[3:4], ties.method = "first")], c(1, rels[3:8])[rank(table.2.7$duration[5:11], ties.method = "first")])
Severity model
There are a couple of points to note here:
 We will model using a Gamma distribution for the errors, following the discussion on page 20 and also pages 3334. Note the point that this is only one of several plausible candidate distributions.
 Because we are using the Gamma distribution we need to remove the zero values from the data; we do this using the
table.1.2[table.1.2$medskad > 0, ]
construct.  To reproduce the values from the book, we use the noncanonical
"log"
link function even though the canonical function ("inverse")
gives a slightly better fit (residual deviance 5.9 versus 8.0 on 16 degrees of freedom). This follows the approach discussed in Example 2.3 on page 30.
model.severity < glm(medskad ~ premiekl + moptva + zon, data = table.1.2[table.1.2$medskad > 0, ], family = Gamma("log"), weights = antskad) rels < coef( model.severity ) rels < exp( rels[1] + rels[1] ) / exp( rels[1] ) ## Aside: For the canonical link function use ## rels < rels[1] / (rels[1] + rels[1]) table.2.7$rels.severity < c(c(1, rels[1])[rank(table.2.7$duration[1:2], ties.method = "first")], c(1, rels[2])[rank(table.2.7$duration[3:4], ties.method = "first")], c(1, rels[3:8])[rank(table.2.7$duration[5:11], ties.method = "first")])
Combining the models
Now it is trivial to combine and display the results.
table.2.7$rels.pure.premium < with(table.2.7, rels.frequency * rels.severity) print(table.2.7, digits = 2)
rating.factor  class  duration  n.claims  rels.frequency  rels.severity  rels.pure.premium  

1  Vehicle class  1  9833.20  391  1.00  1.00  1.00 
2  Vehicle class  2  8825.10  395  0.78  0.55  0.42 
11  Vehicle age  1  1918.40  141  1.55  1.79  2.78 
21  Vehicle age  2  16739.90  645  1.00  1.00  1.00 
12  Zone  1  1451.40  206  7.10  1.21  8.62 
22  Zone  2  2486.30  209  4.17  1.07  4.48 
3  Zone  3  2888.70  132  2.23  1.07  2.38 
4  Zone  4  10069.10  207  1.00  1.00  1.00 
5  Zone  5  246.10  6  1.20  1.21  1.46 
6  Zone  6  1369.20  23  0.79  0.98  0.78 
7  Zone  7  147.50  3  1.00  1.20  1.20 
2.4 Case Study: Motorcycle Insurance
The authors use the term Case Study for larger exercises. There is quite a lot to cover here, and some of the questions are more about discussing the results. We will focus on getting the results.
The Data
Obtaining the data
First we read the (fixed width) data. The column names are the same (based on Swedish language) as in the book.
columns < c(agarald = 2L, kon = 1L, zon = 1L, mcklass = 1L, fordald = 2L, bonuskl = 1L, duration = 8L, antskad = 4L, skadkost = 8L) column.classes < c("integer", rep("factor", 3), "integer", "factor", "numeric", rep("integer", 2)) stopifnot(length(columns) == length(column.classes)) con < url("http://www2.math.su.se/~esbj/GLMbook/mccase.txt") mccase < read.fwf(con, widths = columns, header = FALSE, col.names = names(columns), colClasses = column.classes, na.strings = NULL, comment.char = "") try(close(con), silent = TRUE) rm(columns, column.classes, con) mccase$mcklass < ordered(mccase$mcklass) mccase$bonuskl < ordered(mccase$bonuskl)
Adding metadata information
We are more than a little obsessed with documenting our data, so here goes. See the book for the details.
comment(mccase) < c("Title: Partial casco insurance for motorcycles from Wasa, 19941998", "Source: http://www2.math.su.se/~esbj/GLMbook/mccase.txt", "Copyright: http://www2.math.su.se/~esbj/GLMbook/") comment(mccase$agarald) < c("The owner's age, between 0 and 99", "Name: Age of Owner") comment(mccase$kon) < c("Name: Gender of Owner", "Code: M=Male", "Code: K=Female") comment(mccase$zon) < c("Name: Geographic Zone", "Code: 1=Central and semicentral parts of Sweden's three largest cities", "Code: 2=suburbs and middlesized towns", "Code: 3=Lesser towns, except those in 5 or 7", "Code: 4=Small towns and countryside, except 57", "Code: 5=Northern towns", "Code: 6=Northern countryside", "Code: 7=Gotland (Sweden's largest island)") comment(mccase$mcklass) < c("Name: MC Class", "Description: A classification by the EV ratio, defined as (engine power in kW × 100) / (vehicle weight in kg + 75), rounded to the nearest lower integer.", "Code: 1=EV ratio <= 5", "Code: 2=EV ratio 68", "Code: 3=EV ratio 912", "Code: 4=EV ratio 1315", "Code: 5=EV ratio 1619", "Code: 6=EV ratio 2024", "Code: 7=EV ratio >= 25") comment(mccase$fordald) < c("Vehicle age, between 0 and 99", "Name: Vehicle Age") comment(mccase$bonuskl) < c("Name: Bonus Class", "Description: A driver starts with bonus class 1; for each claimfree year the bonus class is increased by 1. After the first claim the bonus is decreased by 2; the driver cannot return to class 7 with less than 6 consecutive claim free years.") comment(mccase$duration) < c("Name: Duration", "Comment: The number of policy years", "Unit: year") comment(mccase$antskad) < c("Name: Number of Claims") comment(mccase$skadkost) < c("Name: Cost of Claims", "Unit: SEK")
Rating factors
We finally add the rating factors from the book.
mccase$rating.1 < mccase$zon mccase$rating.2 < mccase$mcklass mccase$rating.3 < cut(mccase$fordald, breaks = c(0, 1, 4, 99), labels = as.character(1:3), include.lowest = TRUE, ordered_result = TRUE) mccase$rating.4 < ordered(mccase$bonuskl) # Drop comments levels(mccase$rating.4) < # Combine levels c("1", "1", "2", "2", rep("3", 3))
Save the data
Never forget to save.
save(mccase, file = "mccase.RData")
Now we can start tackling the exercises.
Problem 1: Aggregate to cells of current tariff
Aggregate the data to the cells of the current tariff. Compute the empirical claim frequency and severity at this level.
We really like the data.table
package. The syntax is much easier on the eye (and the hand). As a bonus, it scales well to much larger data sets than data.frame
. If you are not already using it, now is the time to start.
if (!exists("mccase")) load("mccase.RData") ## Conver to data.table library("data.table") mccase < data.table(mccase, key = paste("rating", 1:4, sep = ".")) ## Aggregate to levels of current rating factors mccase.current < mccase[, list(duration = sum(duration), antskad = sum(antskad), skadkost = sum(skadkost), num.policies = .N), by = key(mccase)] ## Claim frequency and severity. Change NaN to NA. mccase.current$claim.freq < with(mccase.current, ifelse(duration != 0, antskad / duration, NA_real_)) mccase.current$severity < with(mccase.current, ifelse(antskad != 0, skadkost / antskad, NA_real_)) ## Save save(mccase.current, file = "mccase.current.RData")
Problem 2: Determine how the duration and number of claims is distributed
Determine how the duration and number of claims is distributed for each of the rating factor classes, as an indication of the accuracy of the statistical analysis.
This is one of those 'there is no right answer' questions, but let us have a look at the data. First load it if needed and also load the libraries we will be using.
## Load data if needed library("data.table") if (!exists("mccase")) load("mccase.RData") if (!exists("mccase.current")) load("mccase.current.RData") if (!is(mccase, "data.table")) mccase < data.table(mccase, key = paste("rating", 1:4, sep = ".")) library("grid") library("ggplot2")
1. Number of claims (antskad)
Let us start with the number of claims for the undelying data. First we plot the number of policies for each number of claims (range of number of claims is {0, 1, 2}); this is one way to do it:
plot.titles < c("Geo. zone", "MC class", "Vehicle age", "Bonus class") plots < lapply(1:4, function(i) ggplot(mccase, aes(antskad)) + geom_histogram(aes(weight = duration)) + scale_x_discrete(limits = c(0, 2)) + scale_y_log10() + facet_grid(paste("rating.", i, " ~ .", sep = ""), scales = "fixed") ## We drop the axis titles to make more room for the data + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.text.x = theme_blank(), axis.text.y = theme_blank(), axis.title.y = theme_blank(), axis.ticks = theme_blank(), title = plot.titles[i]) ) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4))) ## We can ignore the warnings from displaying the plots for now for (i in 1:4) print(plots[[i]], vp = viewport(layout.pos.col = i))
There is certainly some variation there, especially on the first rating factor.
Recall from the discussion on page 18 the assumption that the number of claims for a single policy is a Poisson process and the distribution of the number of claims in a tariff cell is Poisson distributed. Let us first look at the whole data set (and note that the data.table notation is used; if using data.frame be sure to have drop=FALSE):
data < mccase[order(antskad), list(N = .N, w = sum(duration)), by = antskad] M < glm(N ~ antskad, family = poisson(), weights = w, data = data) data$predicted < round(predict(M, data[, list(antskad)], type = "response")) print(data[, list(antskad, N, predicted)], digits = 1)
antskad  n  predicted  

1  0  63878  63878.0 
2  1  643  646.0 
3  2  27  7.0 
Not too bad, but then it can't really be with that heavy weighting to the first value of antskad. We can show the same split by the levels of the first rating factor as an example:
data < mccase[order(rating.1, antskad), list(N = .N, w = sum(duration)), by = list(rating.1, antskad)] modelPoisson < function(data) { M < glm(N ~ antskad, family = poisson(), weights = w, data = data) return(M) } predictionPoisson < function (data) { M < modelPoisson(data) p < predict(M, data[, list(antskad)], type = "response") return(p) } data$predicted < unlist(lapply(levels(data$rating.1), function (l) round(predictionPoisson(data[rating.1 == l])))) print(data[, list(rating.1, antskad, N, predicted)], digits = 1)
rating.1  antskad  n  predicted  

1  1  0  8409  8409.0 
2  1  1  163  164.0 
3  1  2  10  3.0 
4  2  0  11632  11632.0 
5  2  1  157  157.0 
6  2  2  5  2.0 
7  3  0  12604  12604.0 
8  3  1  113  113.0 
9  3  2  5  1.0 
10  4  0  24626  24626.0 
11  4  1  184  185.0 
12  4  2  6  1.0 
13  5  0  2368  2368.0 
14  5  1  9  9.0 
15  6  0  3867  3867.0 
16  6  1  16  16.0 
17  6  2  1  0.0 
18  7  0  372  372.0 
19  7  1  1  1.0 
You get the idea....
If we can't really see very much at the individual claims level, how does it appear when we look at the levels aggregated to the current tariff cells? Here, of course, we have much more data with up to 33 claims in one cell, but we will limit the display to the first few number of claims.
high.limit < 7L # Only display up to this many claims plots < lapply(1:4, function(i) ggplot(mccase.current, aes(antskad)) + geom_histogram(breaks = 0:high.limit, aes(weight = duration)) + scale_x_discrete(limits = c(0, high.limit), breaks = 0:high.limit) ## Following xlim() needed for scale_x_discrete only; see ## https://groups.google.com/d/msg/ggplot2/wLWGCUz8K6k/DeVudyfXyKgJ + xlim(0, high.limit) + scale_y_log10() + facet_grid(paste("rating.", i, " ~ .", sep = ""), scales = "fixed") ## We drop the axis titles to make more room for the data + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.text.x = theme_blank(), axis.text.y = theme_blank(), axis.title.y = theme_blank(), axis.ticks = theme_blank(), title = plot.titles[i]) ) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4))) for (i in 1:4) print(plots[[i]], vp = viewport(layout.pos.col = i))
There is certainly something going on for low bonus classes and high vehicle ages that is not straightforward Poisson distribution. Looking at the two in combination we see the old motorcycle  low bonus class (rating.3 = 3 and rating.4 = 1) is the main culpit, but also note that there is relatively little data in these cells.
ggplot(mccase.current, aes(antskad)) + geom_histogram(breaks = 0:high.limit, aes(weight = duration)) + scale_x_discrete(limits = c(0, high.limit), breaks = 0:high.limit) + xlim(0, high.limit) + scale_y_log10() + facet_grid(rating.4 ~ rating.3, scales = "fixed", labeller = "label_both")
We can fit the poisson distribution using the same predictionPoisson() function as before. We also look at the fit in a simple way (the  much!  better approach is to use the residuals from the fitted model)
## rating.3 is the vehicle age. data < mccase.current[order(rating.3, antskad), list(N = .N, w = sum(duration)), by = list(rating.3, antskad)] data$predicted < unlist(lapply(levels(data$rating.3), function (l) round(predictionPoisson(data[rating.3 == l])))) ## Show the fit print(data[antskad <= high.limit, list(rating.3, antskad, N, predicted)], digits = 1) ## Show simplistic residuals per level of the rating factor print(data[antskad <= high.limit, list(res = sum(abs(predicted  N))), by = rating.3]) ## rating.4 is the bonus class data < mccase.current[order(rating.4, antskad), list(N = .N, w = sum(duration)), by = list(rating.4, antskad)] data$predicted < unlist(lapply(levels(data$rating.4), function (l) round(predictionPoisson(data[rating.4 == l])))) ## Show the fit print(data[antskad <= high.limit, list(rating.4, antskad, N, predicted)], digits = 1) ## Show simplistic residuals per rating factor print(data[antskad <= high.limit, list(res = sum(abs(predicted  N))), by = rating.4])
We will not show the tables here (as we said: there is a better way). Instead, we can show the fitted Poisson distribution:
ggplot(data, aes(x = antskad, y = N)) + geom_bar(breaks = 0L:high.limit, stat = "identity") + facet_wrap( ~ rating.4) + geom_line(aes(y = predicted), data = data, colour = "red", size = 1) + xlim(0.5, high.limit + 0.5)
2. Duration
We can plot the distribution of duration as before. Here 440 is the 90% quantile for the duration (412) rounded up to the next multiplier of binwidth.
plots < lapply(1:4, function(i) ggplot(mccase.current, aes(duration)) + geom_histogram(binwidth = 40) + xlim(0, 440) + scale_y_log10() + facet_grid(paste("rating.", i, " ~ .", sep = ""), scales = "fixed") ## We drop the axis titles to make more room for the data + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.text.x = theme_blank(), axis.text.y = theme_blank(), axis.title.y = theme_blank(), axis.ticks = theme_blank(), title = plot.titles[i]) ) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4))) ## We can ignore the warnings from displaying the plots for now for (i in 1:4) print(plots[[i]], vp = viewport(layout.pos.col = i))
However, in this case it is much more illuminating to look at the overall distribution:
ggplot(mccase, aes(duration)) + geom_histogram(binwidth = 0.05) + xlim(0, 3) + opts(title = "Duration of motorcycle policies") + annotate("text", 3.0, 1e4, label = "Histogram bin width = 0.05", size = 3, hjust = 1, vjust = 1)
Big spikes near 1/2 and 1 year (and in general spikes around 1/2 year multiples) suggests that either (1) this is not a random sample of policies or (2) there is a strong seasonal effect in the time of year people initially sign up for the policy.
Problem 3: Determine the relativities for claim frequency and severity
Determine the relativities for claim frequency and severity separately, by using GLMs; use the results to get relativities for the pure premium.
We first load the data (if needed) and create a data frame to hold the output, following the structure of Table 2.8 in the book.
library("data.table") if (!exists("mccase.current")) load("mccase.current.RData") case.2.4 < data.frame(rating.factor = c(rep("Zone", nlevels(mccase.current$rating.1)), rep("MC class", nlevels(mccase.current$rating.2)), rep("Vehicle age", nlevels(mccase.current$rating.3)), rep("Bonus class", nlevels(mccase.current$rating.4))), class = with(mccase.current, c(levels(rating.1), levels(rating.2), levels(rating.3), levels(rating.4))), ## These are the values from Table 2.8 in the book: relativity = c(7.678, 4.227, 1.336, 1.000, 1.734, 1.402, 1.402, 0.625, 0.769, 1.000, 1.406, 1.875, 4.062, 6.873, 2.000, 1.200, 1.000, 1.250, 1.125, 1.000), stringsAsFactors = FALSE) print(case.2.4, digits = 3)
rating.factor  class  relativity  

1  zone  1  7.678 
2  zone  2  4.227 
3  zone  3  1.336 
4  zone  4  1.000 
5  zone  5  1.734 
6  zone  6  1.402 
7  zone  7  1.402 
8  mc class  1  0.625 
9  mc class  2  0.769 
10  mc class  3  1.000 
11  mc class  4  1.406 
12  mc class  5  1.875 
13  mc class  6  4.062 
14  mc class  7  6.873 
15  vehicle age  1  2.000 
16  vehicle age  2  1.200 
17  vehicle age  3  1.000 
18  bonus class  1  1.250 
19  bonus class  2  1.125 
20  bonus class  3  1.000 
Close enough to the book.
First we set the contrasts so the baseline for the models is the level with the highest duration. This is the same approach we used before.
library("foreach") new.cols < foreach (rating.factor = paste("rating", 1:4, sep = "."), .combine = rbind) %do% { totals < mccase.current[, list(D = sum(duration), N = sum(antskad), C = sum(skadkost)), by = rating.factor] n.levels < nlevels(mccase.current[[rating.factor]]) contrasts(mccase.current[[rating.factor]]) < contr.treatment(n.levels)[rank(totals[["D"]], ties.method = "first"), ] data.frame(duration = totals[["D"]], n.claims = totals[["N"]], skadkost = totals[["C"]]) } case.2.4 < cbind(case.2.4, new.cols) rm(new.cols)
We next fit the models and combine the results as in Example 2.5 above. Here we also make a note of a simple measure of the goodness of fit: the residual deviance and the degrees of freedom. The frequency model is a good fit, the severity model is not.
## Model the frequency model.frequency < glm(antskad ~ rating.1 + rating.2 + rating.3 + rating.4 + offset(log(duration)), data = mccase.current[duration > 0], family = poisson) ## Res. dev. 360 on 389 dof rels < coef( model.frequency ) rels < exp( rels[1] + rels[1] ) / exp( rels[1] ) case.2.4$rels.frequency < c(c(1, rels[1:6])[rank(case.2.4$duration[1:7], ties.method = "first")], c(1, rels[7:12])[rank(case.2.4$duration[8:14], ties.method = "first")], c(1, rels[13:14])[rank(case.2.4$duration[15:17], ties.method = "first")], c(1, rels[15:16])[rank(case.2.4$duration[18:20], ties.method = "first")]) ## Model the severity. We stick with the noncanonical link function ## for the time being. model.severity < glm(skadkost ~ rating.1 + rating.2 + rating.3 + rating.4, data = mccase.current[skadkost > 0,], family = Gamma("log"), weights = antskad) ## Res.dev. 516 on 164 dof rels < coef( model.severity ) rels < exp( rels[1] + rels[1] ) / exp( rels[1] ) case.2.4$rels.severity < c(c(1, rels[1:6])[rank(case.2.4$duration[1:7], ties.method = "first")], c(1, rels[7:12])[rank(case.2.4$duration[8:14], ties.method = "first")], c(1, rels[13:14])[rank(case.2.4$duration[15:17], ties.method = "first")], c(1, rels[15:16])[rank(case.2.4$duration[18:20], ties.method = "first")]) ## Combine the frequency and severity case.2.4$rels.pure.prem < with(case.2.4, rels.frequency * rels.severity)
Finally we convert, save, and compare with the current values.
## Convert to data.table library("data.table") case.2.4 < data.table(case.2.4) ## Save save(case.2.4, file = "case.2.4.RData") ## Compare with current values print(case.2.4[, list(rating.factor, class, duration, n.claims, skadkostK = round(skadkost / 1e3), relativity, rels.pure.prem)], digits = 3)
rating.factor  class  duration  n.claims  skadkostK  relativity  rels.pure.prem  

1  Zone  1  6205.310  183  5540.000  7.678  6.239 
2  Zone  2  10103.090  167  4811.000  4.227  3.179 
3  Zone  3  11676.573  123  2523.000  1.336  0.993 
4  Zone  4  32628.493  196  3775.000  1.000  1.000 
5  Zone  5  1582.112  9  105.000  1.734  0.155 
6  Zone  6  2799.945  18  288.000  1.402  0.223 
7  Zone  7  241.288  1  1.000  1.402  0.002 
8  MC class  1  5190.351  46  993.000  0.625  0.395 
9  MC class  2  3990.115  57  883.000  0.769  0.536 
10  MC class  3  21665.679  166  5372.000  1.000  1.000 
11  MC class  4  11739.882  98  2192.000  1.406  0.574 
12  MC class  5  13439.926  149  3297.000  1.875  1.413 
13  MC class  6  8880.134  175  4161.000  4.062  4.876 
14  MC class  7  330.723  6  145.000  6.873  0.602 
15  Vehicle age  1  4955.403  126  4964.000  2.000  4.715 
16  Vehicle age  2  9753.811  145  5507.000  1.200  2.021 
17  Vehicle age  3  50527.597  426  6570.000  1.000  1.000 
18  Bonus class  1  19893.370  207  4558.000  1.250  0.797 
19  Bonus class  2  9615.764  121  3627.000  1.125  0.604 
20  Bonus class  3  35727.677  369  8857.000  1.000  1.000 
It is really rather different from the current relativity, as we will also see below. But remember that the severity model was a poor fit: that would be my first place to start looking if staying within the insurance pricing framework outlined in this book.
Problem 4: Discussions
Just note here that the ratio for the relativity we calculated and for the existing one can be found as
with(case.2.4, max(rels.pure.prem) / min(rels.pure.prem)) # 3552 with(case.2.4, max(relativity) / min(relativity)) # 12
Note the huge range of relativities in our new model (3552 versus 12).
Jump to comments.
You may also like these posts:

R code for Chapter 1 of NonLife Insurance Pricing with GLM
Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is NonLife Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the underappreciated read.fwf function.

Feature selection: Using the caret package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at allrelevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE100 constituent companies and find that the relation still holds four years later and across a continent.

For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to …
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...