Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is NonLife Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson (Amazon UK  US). We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R, the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.
#!/usr/bin/Rscript ## PricingGLM1.r  Code for Chapter 1 of "NonLife Insurance Pricing with GLM" ## Copyright © 2012 CYBAEA Limited (http://www.cybaea.net/) ## @book{ohlsson2010non, ## title={NonLife Insurance Pricing with Generalized Linear Models}, ## author={Ohlsson, E. and Johansson, B.}, ## isbn={9783642107900}, ## series={Eaa Series: Textbook}, ## url={http://books.google.com/books?id=l4rjeflJ\_bIC}, ## year={2010}, ## publisher={Springer Verlag} ## }
With the preliminaries out of the way, let us get started.
Example 1.2
We grab the data for Table 1.2 from the book's web site and store it as an R object with lots of good meta information.
################ ### Example 1.2 con < url("http://www2.math.su.se/~esbj/GLMbook/moppe.sas") data < readLines(con, n = 200L, warn = FALSE, encoding = "unknown") close(con) ## Find the data range data.start < grep("^cards;", data) + 1L data.end < grep("^;", data[data.start:999L]) + data.start  2L table.1.2 < read.table(text = data[data.start:data.end], header = FALSE, sep = "", quote = "", col.names = c("premiekl", "moptva", "zon", "dur", "medskad", "antskad", "riskpre", "helpre", "cell"), na.strings = NULL, colClasses = c(rep("factor", 3), "numeric", rep("integer", 4), "NULL"), comment.char = "") rm(con, data, data.start, data.end) # Cleanup comment(table.1.2) < c("Title: Partial casco moped insurance from Wasa insurance, 19941999", "Source: http://www2.math.su.se/~esbj/GLMbook/moppe.sas", "Copyright: http://www2.math.su.se/~esbj/GLMbook/") ## See the SAS code for this derived field table.1.2$skadfre = with(table.1.2, antskad / dur) ## English language column names as comments: comment(table.1.2$premiekl) < c("Name: Class", "Code: 1=Weight over 60kg and more than 2 gears", "Code: 2=Other") comment(table.1.2$moptva) < c("Name: Age", "Code: 1=At most 1 year", "Code: 2=2 years or more") comment(table.1.2$zon) < c("Name: Zone", "Code: 1=Central and semicentral parts of Sweden's three largest cities", "Code: 2=suburbs and middlesized towns", "Code: 3=Lesser towns, except those in 5 or 7", "Code: 4=Small towns and countryside, except 57", "Code: 5=Northern towns", "Code: 6=Northern countryside", "Code: 7=Gotland (Sweden's largest island)") comment(table.1.2$dur) < c("Name: Duration", "Unit: year") comment(table.1.2$medskad) < c("Name: Claim severity", "Unit: SEK") comment(table.1.2$antskad) < "Name: No. claims" comment(table.1.2$riskpre) < c("Name: Pure premium", "Unit: SEK") comment(table.1.2$helpre) < c("Name: Actual premium", "Note: The premium for one year according to the tariff in force 1999", "Unit: SEK") comment(table.1.2$skadfre) < c("Name: Claim frequency", "Unit: /year") ## Save results for later save(table.1.2, file = "table.1.2.RData") ## Print the table (not as pretty as the book) print(table.1.2) ################
premiekl  moptva  zon  dur  medskad  antskad  riskpre  helpre  skadfre  

1  1  1  1  62.9000  18256  17  4936  2049  0.2703 
2  1  1  2  112.9000  13632  7  845  1230  0.0620 
3  1  1  3  133.1000  20877  9  1411  762  0.0676 
4  1  1  4  376.6000  13045  7  242  396  0.0186 
5  1  1  5  9.4000  0  0  0  990  0.0000 
6  1  1  6  70.8000  15000  1  212  594  0.0141 
7  1  1  7  4.4000  8018  1  1829  396  0.2273 
8  1  2  1  352.1000  8232  52  1216  1229  0.1477 
9  1  2  2  840.1000  7418  69  609  738  0.0821 
10  1  2  3  1378.3000  7318  75  398  457  0.0544 
11  1  2  4  5505.3000  6922  136  171  238  0.0247 
12  1  2  5  114.1000  11131  2  195  594  0.0175 
13  1  2  6  810.9000  5970  14  103  356  0.0173 
14  1  2  7  62.3000  6500  1  104  238  0.0161 
15  2  1  1  191.6000  7754  43  1740  1024  0.2244 
16  2  1  2  237.3000  6933  34  993  615  0.1433 
17  2  1  3  162.4000  4402  11  298  381  0.0677 
18  2  1  4  446.5000  8214  8  147  198  0.0179 
19  2  1  5  13.2000  0  0  0  495  0.0000 
20  2  1  6  82.8000  5830  3  211  297  0.0362 
21  2  1  7  14.5000  0  0  0  198  0.0000 
22  2  2  1  844.8000  4728  94  526  614  0.1113 
23  2  2  2  1296.0000  4252  99  325  369  0.0764 
24  2  2  3  1214.9000  4212  37  128  229  0.0305 
25  2  2  4  3740.7000  3846  56  58  119  0.0150 
26  2  2  5  109.4000  3925  4  144  297  0.0366 
27  2  2  6  404.7000  5280  5  65  178  0.0124 
28  2  2  7  66.3000  7795  1  118  119  0.0151 
That was easy. Now for something a little harder.
Example 1.3
Here we are concerned with replicating Table 1.4. We do it slowly, stepbystep, for pedagogical reasons.
################ ### Example 1.3 if (!exists("table.1.2")) load("table.1.2.RData") ## We calculate each of the columns individually and slowly here ## to show each step ## First we have simply the labels of the table rating.factor < with(table.1.2, c(rep("Vehicle class", nlevels(premiekl)), rep("Vehicle age", nlevels(moptva)), rep("Zone", nlevels(zon)))) ## The Class column class.num < with(table.1.2, c(levels(premiekl), levels(moptva), levels(zon))) ## The Duration is the sum of durations within each class duration.total < c(with(table.1.2, tapply(dur, premiekl, sum)), with(table.1.2, tapply(dur, moptva, sum)), with(table.1.2, tapply(dur, zon, sum))) ## Calculate relativities in the tariff ## The denominator of the fraction is the class with the highest exposure ## (i.e. the maximum total duration): we make that explicit with the ## which.max() construct. We also set the contrasts to use this as the base, ## which will be useful for the glm() model later. class.base < which.max(duration.total[1:2]) age.base < which.max(duration.total[3:4]) zone.base < which.max(duration.total[5:11]) rt.class < with(table.1.2, tapply(helpre, premiekl, sum)) rt.class < rt.class / rt.class[class.base] rt.age < with(table.1.2, tapply(helpre, moptva, sum)) rt.age < rt.age / rt.age[age.base] rt.zone < with(table.1.2, tapply(helpre, zon, sum)) rt.zone < rt.zone / rt.zone[zone.base] contrasts(table.1.2$premiekl) < contr.treatment(nlevels(table.1.2$premiekl))[rank(duration.total[1:2], ties.method = "first"), ] contrasts(table.1.2$moptva) < contr.treatment(nlevels(table.1.2$moptva))[rank(duration.total[3:4], ties.method = "first"), ] contrasts(table.1.2$zon) < contr.treatment(nlevels(table.1.2$zon))[rank(duration.total[5:11], ties.method = "first"), ]
The contrasts could also have been set with the base=
argument, e.g. contrasts(table.1.2$zon) < contr.treatment(nlevels(table.1.2$zon), base = zone.base)
, which would be closer in spirit to the SAS code. But I like the idiom presented here where we follow the duration order; it also extends well to other (i.e. not treatment) contrasts. I just wish rank()
had an decreasing=
argument like order()
which I think would be clearer than using rank(x)
to get a decreasing sort order.
That was the easy part. At this stage in the book you are not really expected to understand the next step so do not despair! We just show how easy it is to replicate the SAS code in R. An alternative approach using direct optimization is outlined in Exercise 1.3 below.
## Relativities of MMT; we use the glm approach here as per the book's ## SAS code at http://www2.math.su.se/~esbj/GLMbook/moppe.sas m < glm(riskpre ~ premiekl + moptva + zon, data = table.1.2, family = poisson("log"), weights = dur) ## If the next line is a mystery then you need to ## (1) read up on contrasts or ## (2) remember that the link function is log() which is why we use exp here rels < exp( coef(m)[1] + coef(m)[1] ) / exp(coef(m)[1]) rm.class < c(1, rels[1]) # See rm.zone below for the rm.age < c(rels[2], 1) # general approach rm.zone < c(1, rels[3:8])[rank(duration.total[5:11], ties.method = "first")] ## Create and save the data frame table.1.4 < data.frame(Rating.factor = rating.factor, Class = class.num, Duration = duration.total, Rel.tariff = c(rt.class, rt.age, rt.zone), Rel.MMT = c(rm.class, rm.age, rm.zone)) save(table.1.4, file = "table.1.4.RData") print(table.1.4, digits = 3) rm(rating.factor, class.num, duration.total, class.base, age.base, zone.base, rt.class, rt.age, rt.zone, rm.class, rm.age, rm.zone, m, rels) ################
The result is something like this:
Rating.factor  Class  Duration  Rel.tariff  Rel.MMT  

1  Vehicle class  1  9833.20  1.00  1.00 
2  Vehicle class  2  8825.10  0.50  0.43 
3  Vehicle age  1  1918.40  1.67  2.73 
4  Vehicle age  2  16739.90  1.00  1.00 
5  Zone  1  1451.40  5.17  8.97 
6  Zone  2  2486.30  3.10  4.19 
7  Zone  3  2888.70  1.92  2.52 
8  Zone  4  10069.10  1.00  1.00 
9  Zone  5  246.10  2.50  1.24 
10  Zone  6  1369.20  1.50  0.74 
11  Zone  7  147.50  1.00  1.23 
Note the rather unusual and apparently inconsistent rounding in the book: 147, 1.66, and 5.16 would be better as 148 (the value is 147.5), 1.67, and 5.17.
Exercise 1.3
Here it gets interesting as we get a different value from the authors. Possibly a small bug on our part but at least we provide the code for you to check. So if you spot a problem let us know in the comments.
################ ## Exercise 1.3 ## The values from the book g0 < 0.03305 g12 < 2.01231 g22 < 0.74288 dim.names < list(Milage = c("Low", "High"), Age = c("New", "Old")) pyears < matrix(c(47039, 56455, 190513, 28612), nrow = 2, dimnames = dim.names) claims < matrix(c(0.033, 0.067, 0.025, 0.049), nrow = 2, dimnames = dim.names) ## Function to calculate the error of the estimate GvalsError < function (gvals) { ## The current estimates g0 < gvals[1] g12 < gvals[2] g22 < gvals[3] ## The current estimates in convenient matrix form G < matrix(c(1, 1, g12, g22), nrow = 2) G1 < matrix(c(1, g12), nrow = 2, ncol = 2) G2 < matrix(c(1, g22), nrow = 2, ncol = 2, byrow = TRUE) ## The calculated values G0 < addmargins(claims * pyears)["Sum", "Sum"] / ( sum(pyears * G1 * G2) ) G12 < addmargins(claims * pyears)["High", "Sum"] / ( g0 * addmargins(pyears * G2)["High", "Sum"] ) G22 < addmargins(claims * pyears)["Sum", "Old"] / ( g0 * addmargins(pyears * G1)["Sum", "Old"] ) ## The sum of squared errors error < (g0  G0)^2 + (g12  G12)^2 + (g22  G22)^2 return(error) } ## Minimize the error function to obtain our estimate gamma < optim(c(g0, g12, g22), GvalsError) stopifnot(gamma$convergence == 0) gamma < gamma$par values < data.frame(legend = c("Our calculation", "Book value"), g0 = c(gamma[1], g0), g12 = c(gamma[2], g12), g22 = c(gamma[3], g22), row.names = "legend") print(values, digits = 4) ## Close, but not the same. rm(g0, g12, g22, dim.names, pyears, claims, gamma, values) ################
The resulting table is something like:
g0  g12  g22  

Our calculation  0.0334  1.9951  0.7452 
Book value  0.0331  2.0123  0.7429 
Close, but not the same. Perhaps they used a different error function.
Jump to comments.
You may also like these posts:

R code for Chapter 2 of NonLife Insurance Pricing with GLM
We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as NonLife Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK  US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.

Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the underappreciated read.fwf function.

Feature selection: Using the caret package
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at allrelevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

Excel Tip: Array boolean operator
I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below.

Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE100 constituent companies and find that the relation still holds four years later and across a continent.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...