I’m a cowboy…

[This article was first published on aRsing about in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


OK, I concede, that was provocative – but hey, would it have caught your attention if I provided the more apt title, “data wrangling for reserving actuaries”? Probably not.

In terms of the reserving actuary’s role, the ChainLadder package in R by Markus Gesmann et al. is a real boon – see here.  In this post I will indicate how to use various functions to visualise your claims data and/or reformat it depending on the situation at hand.  In many respects this post follows guidance provided in the vignette to ChainLadder;  however, the post extends things in some areas and closes with an example of a more customised visualisation using the ggplot2 package.


Let’s get started.  First up, a number of code libraries (called packages), must be accessible by R.  To achieve this, we use the following code:


Displaying claims development triangles

ChainLadder comes with a number of inbuilt data sets; let’s look at the RAA data set, which relates to historic claims sourced from the Reinsurance Association of America:

#load triangle:
plot(RAA/1e3,lattice=TRUE, main="RAA Data by Accident Year (USD000s)",
xlab="Dev Yrs")

This returns the annual-annual triangle

origin    1     2     3     4     5     6     7     8     9    10
  1981 5012  8269 10907 11805 13539 16181 18009 18608 18662 18834
  1982  106  4285  5396 10666 13782 15599 15496 16169 16704    NA
  1983 3410  8992 13873 16141 18735 22214 22863 23466    NA    NA
  1984 5655 11555 15766 21266 23425 26083 27067    NA    NA    NA
  1985 1092  9565 15836 22169 25955 26180    NA    NA    NA    NA
  1986 1513  6445 11702 12935 15852    NA    NA    NA    NA    NA
  1987  557  4020 10946 12314    NA    NA    NA    NA    NA    NA
  1988 1351  6947 13112    NA    NA    NA    NA    NA    NA    NA
  1989 3133  5395    NA    NA    NA    NA    NA    NA    NA    NA
  1990 2063    NA    NA    NA    NA    NA    NA    NA    NA    NA

and an associated chart, depicting the development of individual years’ claims:


The term “annual-annual” arises because the (re)insurer’s loss data has been arranged based on the year in which the claims originated (denoted “origin” above and running down the page) and how the aggregate value of claims developed over successive one year time steps (denoted “dev” above and running across the page).

At first glance, the chart indicates that the loss progression for each year (i.e. 1981 through 1989 say) is smooth. This is good; however, and perhaps unsurprisingly for reinsurance data, it seems like only the oldest origin year (i.e. 1981) has reached something approaching a flat-line profile with claims from that origin year settling at their “ultimate” value.  This remark is significant because the reserving actuary’s primary goal is to forecast the ultimate value of claims for each origin year.

Slice and dice…

One issue with the above triangle is the NAs.  R needs these to ensure no errors in onward processing, e.g. when running algorithms to determine development factors from one time step to another. While this is fine in a computational sense, it’s unlikely to win you fans when presenting at Board level, say. Similarly, we may need:

  • triangles showing incremental movements (in order to understand typical cash flows in each time step), or
  • to recast our output in a “long-format” more suitable for pivot or database tables.

The good news is that all of these situations can be accommodated easily:

#convert cumulative results to incremental
( x = cum2incr(RAA) )
#Make it pretty, aka. NAs begone!
print(x, na.print="")
#convert incremental data to cumulative data:
( y = incr2cum(x) )
#check - this will return a triangle of zeroes as expected:
( RAA-y )
#convert triangles to flat data format
(pivot.format = as.data.frame(RAA))

For brevity, I only show the result of the first line of code here – i.e. the incremental triangle:

origin    1    2    3    4    5    6    7   8   9  10
  1981 5012 3257 2638  898 1734 2642 1828 599  54 172
  1982  106 4179 1111 5270 3116 1817 -103 673 535    
  1983 3410 5582 4881 2268 2594 3479  649 603        
  1984 5655 5900 4211 5500 2159 2658  984            
  1985 1092 8473 6271 6333 3786  225                 
  1986 1513 4932 5257 1233 2917                      
  1987  557 3463 6926 1368                           
  1988 1351 5596 6165                                
  1989 3133 2262                                     
  1990 2063  

The role of the actuary in claims forecasting

Upon notification of a loss, (re)insurers generally establish a “case estimate”.  This figure details the claims handler’s view on how much additional money is needed to settle the claim taking into account specific details of the loss. Given that case estimates are available, non-insurance readers may wonder why actuaries are needed by insurance companies. In short, there are three areas of uncertainty:

  1. Firstly, a natural delay exists between the timing of a claim event and it being reported to the (re)insurer.  If the (re)insurer relies solely on case estimates, it is likely that they will misrepresent the total value of claims liabilities on their balance sheet.
  2. Secondly, it is almost impossible for a claims handler to forecast case estimates with 100% accuracy.  As a result, a method is needed to account for potential inaccuracies in the overall case estimates held.
  3. Lastly, claims may get reopened – e.g. in response to changes in the legal environment or due to the discovery of new information.

Allowance for all three areas of uncertainty is shown on the (re)insurer’s balance sheet under the “incurred but not reported” (or IBNR) line item. To establish the value of IBNR, reserving actuaries often project triangles to ultimate using payment only amounts as well as triangles containing the sum of the paid and case estimates – aka the “incurred” or “reported” loss data.

While I like the look of the chart outputs available from ChainLadder, they only return a graphical view of paid or incurred claims results.  In practice, I often want to see the paid, incurred and ultimate loss forecasts all at the same time.

In this way, it is possible to grasp immediately the size of the case estimates (i.e. the incurred line less the paid line) as well as the IBNR projected (i.e. ultimate line less the incurred line).

The panel plot command in ChainLadder relies on the lattice package to render the graphics above.  While it’s possible to alter the way in which the call to lattice is made, I prefer using ggplot2 for this purpose.

Go on; have your cake and eat it…

The RAA dataset only provides a view on one set of losses – i.e. incurred or paid.  To make an example that’s more realistic, I have opted to use a different data set from the US Casualty Actuarial Society.  The file provides details on the progression of motor losses for a host of individual US insurers. Let’s download the data and then select results for a specific insurer, State Farm Mutual:

mydat = fread('http://www.casact.org/research/reserve_data/ppauto_pos.csv')

#if you had problems using the prior line, which directly reads data in, download  the csv to your working directory then run the following code instead:

#Home-in on one company and pluck-out columns of interest:
 StateFarm = subset(mydat,subset = mydat$GRNAME=="State Farm Mut Grp")
 StateFarm.Data = StateFarm[,c(2:3,5:8,11)]
 StateFarm.Data$AccidentYear = as.numeric(StateFarm.Data$AccidentYear)
 StateFarm.Data$DevelopmentLag = as.numeric(StateFarm.Data$DevelopmentLag)
 head(StateFarm.Data) # check results - shows first 6 rows only

I chose State Farm Mutual Group as it was one of the largest companies in the sample. As a result of the above code, we now have our data in flat table or pivot format:

GRNAME AccidentYear DevelopmentLag IncurLoss_B CumPaidLoss_B BulkLoss_B EarnedPremNet_B
1601 State Farm Mut Grp         1988              1     6906902       2439272    2143806         7809394
1602 State Farm Mut Grp         1988              2     6943321       4722902     887402         7809394
1603 State Farm Mut Grp         1988              3     6919414       5705646     567865         7809394
1604 State Farm Mut Grp         1988              4     6903460       6238289     304773         7809394
1605 State Farm Mut Grp         1988              5     6877558       6519491     168709         7809394
1606 State Farm Mut Grp         1988              6     6870358       6677426      97305         7809394

Using this flat table we can triangulate the results using the as.triangle  command. However, if you do this, it becomes apparent that the data includes claims development over 10 years for each origin period. In reality our most recent years have less data. To make the example more consistent with the typical situation, and to prepare the data for ggplot2 the following changes were made: first, the data was converted to triangular format; then values post-1997 calendar year were removed and the information was then recast in flat table format. Two final steps were then applied, first an assumption was made that each origin year will run at a 90% loss ratio – i.e. US$90 will be paid out by State Farm Mutual for each US$100 they receive – and lastly, the paid, incurred and ultimate information was combined into a single file before tidying up. These steps are captured below:

#now we convert this into a format that ChainLadder likes...
 (paid.tri = as.triangle(StateFarm.Data,origin="AccidentYear",dev="DevelopmentLag",value="CumPaidLoss_B"))
 (inc.tri = as.triangle(StateFarm.Data,origin="AccidentYear",dev="DevelopmentLag",value="IncurLoss_B"))

 #lastly overwrite data values so it appears as if we're working with data as at 1997 year end - i.e. the normal situation
 paid.tri[row(paid.tri)+col(paid.tri)>nrow(paid.tri)+1] = NA
 inc.tri[row(inc.tri)+col(inc.tri)>nrow(inc.tri)+1] = NA

 #convert triangles to frame format and purge N/As
 paid.frame = as.data.frame(paid.tri)
 paid.frame$type = "paid"
 paid.frame = subset(paid.frame,subset = paid.frame$value!="NA")

 inc.frame = as.data.frame(inc.tri)
 inc.frame$type = "incurred"
 inc.frame = subset(inc.frame,subset = inc.frame$value!="NA")

 #in practical situations we are likely to have a view on the ultimate loss cost.  For this blogpost, let's suppose that we want to run with a 90% loss ratio for all years
 ULR = 0.9
 net.prem = unique(StateFarm.Data$EarnedPremNet_B)
 net.ult = data.frame(cbind(AccidentYear=1988:1997,Ult=ULR*net.prem))

 #create a frame showing the ultimates is the same format as the paid and incurred information - good tip here on vlookup approach, https://www.rforexcelusers.com/vlookup-in-r/
 ult.frame = merge(net.ult[, c("AccidentYear", "Ult")],
 inc.frame[, c("AccidentYear", "DevelopmentLag","type")])

 #reorder and rename columns, then override the type
 ult.frame = ult.frame[,c(1,3,2,4)]
 colnames(ult.frame) = colnames(inc.frame)
 ult.frame$type = "Ult"
 ult.frame = ult.frame[ order(ult.frame$AccidentYear, ult.frame$DevelopmentLag), ]

 #combine the three frames together:
 data.combined = data.frame(rbind(paid.frame,inc.frame,ult.frame))

 #clean up temp variables

Having performed all of that manipulation, we can finally punch-in our charting code:

p = ggplot(data=data.combined, aes(x=DevelopmentLag, y = value/1e6, color=type)) +
 geom_point()+geom_line() + facet_wrap(~AccidentYear)

 #ok, let's improve the output in a few ways - firstly, limits and ticks
 p = p+ylim(0,20)
 p = p+ scale_x_continuous(breaks=seq(0, 10, 2)) # Ticks from 0-10, every 2

 p = p+labs(title = "State Farm Mutual - Motor Loss Development",
 subtitle = "Incurred losses show downward development, highlighting redundancy of case reserves",
 caption = "Source: http://www.casact.org/research/reserve_data/ppauto_pos.csv",
 x = "Development Yr", y = "USDm") 

 p = p+theme(axis.text=element_text(size=8),

 #color scale
 p = p+ scale_colour_brewer(palette = "Dark2") #other options are available here: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

to generate the intended visualisation including details of specific notes for the reader’s interest:


The final output shows that the incurred loss position generally reduces with time.  In other words, it appears as though State Farm Mutual’s case estimates are prudent and there is support for posting negative IBNR.  In a future post, I will discuss how one can derive a more appropriate ultimate value than just applying a flat 90% loss ratio assumption for all years.

Closing thoughts

From my own experience, while Excel offers fast and easy ways to produce initial visualisations, if you have multiple charts it is painful to arrange them in a panel and, typically, each chart requires separate and sequential updates.  While automating is possible using Excel VBA, I find that the syntax for chart objects is tricky particularly when dealing with series.collections.  In contrast, R has no issues with panel plots, tends to be more resilient and is very customisable.

So, that’s me signing off. I leave you with the following thoughts on data and the imperfect world we deal with as actuaries… lastly,  if you’d like a copy of the full R script, please visit here

“Life is made up of a series of judgments on insufficient data, and if we waited to run down all our doubts, it would flow past us.” – Learned Hand

To leave a comment for the author, please follow the link and comment on their blog: aRsing about in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)