I’m a cowboy…
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
OK, I concede, that was provocative – but hey, would it have caught your attention if I provided the more apt title, “data wrangling for reserving actuaries”? Probably not.
In terms of the reserving actuary’s role, the ChainLadder package in R by Markus Gesmann et al. is a real boon – see here. In this post I will indicate how to use various functions to visualise your claims data and/or reformat it depending on the situation at hand. In many respects this post follows guidance provided in the vignette to ChainLadder; however, the post extends things in some areas and closes with an example of a more customised visualisation using the ggplot2 package.
Let’s get started. First up, a number of code libraries (called packages), must be accessible by R. To achieve this, we use the following code:
Displaying claims development triangles
ChainLadder comes with a number of inbuilt data sets; let’s look at the RAA data set, which relates to historic claims sourced from the Reinsurance Association of America:
#load triangle: RAA #view plot(RAA/1e3,lattice=TRUE, main="RAA Data by Accident Year (USD000s)", xlab="Dev Yrs")
This returns the annual-annual triangle
dev origin 1 2 3 4 5 6 7 8 9 10 1981 5012 8269 10907 11805 13539 16181 18009 18608 18662 18834 1982 106 4285 5396 10666 13782 15599 15496 16169 16704 NA 1983 3410 8992 13873 16141 18735 22214 22863 23466 NA NA 1984 5655 11555 15766 21266 23425 26083 27067 NA NA NA 1985 1092 9565 15836 22169 25955 26180 NA NA NA NA 1986 1513 6445 11702 12935 15852 NA NA NA NA NA 1987 557 4020 10946 12314 NA NA NA NA NA NA 1988 1351 6947 13112 NA NA NA NA NA NA NA 1989 3133 5395 NA NA NA NA NA NA NA NA 1990 2063 NA NA NA NA NA NA NA NA NA
and an associated chart, depicting the development of individual years’ claims:
The term “annual-annual” arises because the (re)insurer’s loss data has been arranged based on the year in which the claims originated (denoted “origin” above and running down the page) and how the aggregate value of claims developed over successive one year time steps (denoted “dev” above and running across the page).
At first glance, the chart indicates that the loss progression for each year (i.e. 1981 through 1989 say) is smooth. This is good; however, and perhaps unsurprisingly for reinsurance data, it seems like only the oldest origin year (i.e. 1981) has reached something approaching a flat-line profile with claims from that origin year settling at their “ultimate” value. This remark is significant because the reserving actuary’s primary goal is to forecast the ultimate value of claims for each origin year.
Slice and dice…
One issue with the above triangle is the NAs. R needs these to ensure no errors in onward processing, e.g. when running algorithms to determine development factors from one time step to another. While this is fine in a computational sense, it’s unlikely to win you fans when presenting at Board level, say. Similarly, we may need:
- triangles showing incremental movements (in order to understand typical cash flows in each time step), or
- to recast our output in a “long-format” more suitable for pivot or database tables.
The good news is that all of these situations can be accommodated easily:
#convert cumulative results to incremental ( x = cum2incr(RAA) ) #Make it pretty, aka. NAs begone! print(x, na.print="") #convert incremental data to cumulative data: ( y = incr2cum(x) ) #check - this will return a triangle of zeroes as expected: ( RAA-y ) #convert triangles to flat data format (pivot.format = as.data.frame(RAA))
For brevity, I only show the result of the first line of code here – i.e. the incremental triangle:
dev origin 1 2 3 4 5 6 7 8 9 10 1981 5012 3257 2638 898 1734 2642 1828 599 54 172 1982 106 4179 1111 5270 3116 1817 -103 673 535 1983 3410 5582 4881 2268 2594 3479 649 603 1984 5655 5900 4211 5500 2159 2658 984 1985 1092 8473 6271 6333 3786 225 1986 1513 4932 5257 1233 2917 1987 557 3463 6926 1368 1988 1351 5596 6165 1989 3133 2262 1990 2063
The role of the actuary in claims forecasting
Upon notification of a loss, (re)insurers generally establish a “case estimate”. This figure details the claims handler’s view on how much additional money is needed to settle the claim taking into account specific details of the loss. Given that case estimates are available, non-insurance readers may wonder why actuaries are needed by insurance companies. In short, there are three areas of uncertainty:
- Firstly, a natural delay exists between the timing of a claim event and it being reported to the (re)insurer. If the (re)insurer relies solely on case estimates, it is likely that they will misrepresent the total value of claims liabilities on their balance sheet.
- Secondly, it is almost impossible for a claims handler to forecast case estimates with 100% accuracy. As a result, a method is needed to account for potential inaccuracies in the overall case estimates held.
- Lastly, claims may get reopened – e.g. in response to changes in the legal environment or due to the discovery of new information.
Allowance for all three areas of uncertainty is shown on the (re)insurer’s balance sheet under the “incurred but not reported” (or IBNR) line item. To establish the value of IBNR, reserving actuaries often project triangles to ultimate using payment only amounts as well as triangles containing the sum of the paid and case estimates – aka the “incurred” or “reported” loss data.
While I like the look of the chart outputs available from ChainLadder, they only return a graphical view of paid or incurred claims results. In practice, I often want to see the paid, incurred and ultimate loss forecasts all at the same time.
In this way, it is possible to grasp immediately the size of the case estimates (i.e. the incurred line less the paid line) as well as the IBNR projected (i.e. ultimate line less the incurred line).
The panel plot command in ChainLadder relies on the lattice package to render the graphics above. While it’s possible to alter the way in which the call to lattice is made, I prefer using ggplot2 for this purpose.
Go on; have your cake and eat it…
The RAA dataset only provides a view on one set of losses – i.e. incurred or paid. To make an example that’s more realistic, I have opted to use a different data set from the US Casualty Actuarial Society. The file provides details on the progression of motor losses for a host of individual US insurers. Let’s download the data and then select results for a specific insurer, State Farm Mutual:
mydat = fread('http://www.casact.org/research/reserve_data/ppauto_pos.csv') #if you had problems using the prior line, which directly reads data in, download the csv to your working directory then run the following code instead: mydat=read.csv("ppauto_pos.csv") #Home-in on one company and pluck-out columns of interest: StateFarm = subset(mydat,subset = mydat$GRNAME=="State Farm Mut Grp") StateFarm.Data = StateFarm[,c(2:3,5:8,11)] StateFarm.Data$AccidentYear = as.numeric(StateFarm.Data$AccidentYear) StateFarm.Data$DevelopmentLag = as.numeric(StateFarm.Data$DevelopmentLag) head(StateFarm.Data) # check results - shows first 6 rows only
I chose State Farm Mutual Group as it was one of the largest companies in the sample. As a result of the above code, we now have our data in flat table or pivot format:
GRNAME AccidentYear DevelopmentLag IncurLoss_B CumPaidLoss_B BulkLoss_B EarnedPremNet_B 1601 State Farm Mut Grp 1988 1 6906902 2439272 2143806 7809394 1602 State Farm Mut Grp 1988 2 6943321 4722902 887402 7809394 1603 State Farm Mut Grp 1988 3 6919414 5705646 567865 7809394 1604 State Farm Mut Grp 1988 4 6903460 6238289 304773 7809394 1605 State Farm Mut Grp 1988 5 6877558 6519491 168709 7809394 1606 State Farm Mut Grp 1988 6 6870358 6677426 97305 7809394
Using this flat table we can triangulate the results using the as.triangle command. However, if you do this, it becomes apparent that the data includes claims development over 10 years for each origin period. In reality our most recent years have less data. To make the example more consistent with the typical situation, and to prepare the data for ggplot2 the following changes were made: first, the data was converted to triangular format; then values post-1997 calendar year were removed and the information was then recast in flat table format. Two final steps were then applied, first an assumption was made that each origin year will run at a 90% loss ratio – i.e. US$90 will be paid out by State Farm Mutual for each US$100 they receive – and lastly, the paid, incurred and ultimate information was combined into a single file before tidying up. These steps are captured below:
#now we convert this into a format that ChainLadder likes... (paid.tri = as.triangle(StateFarm.Data,origin="AccidentYear",dev="DevelopmentLag",value="CumPaidLoss_B")) (inc.tri = as.triangle(StateFarm.Data,origin="AccidentYear",dev="DevelopmentLag",value="IncurLoss_B")) #lastly overwrite data values so it appears as if we're working with data as at 1997 year end - i.e. the normal situation paid.tri[row(paid.tri)+col(paid.tri)>nrow(paid.tri)+1] = NA inc.tri[row(inc.tri)+col(inc.tri)>nrow(inc.tri)+1] = NA #convert triangles to frame format and purge N/As paid.frame = as.data.frame(paid.tri) paid.frame$type = "paid" paid.frame = subset(paid.frame,subset = paid.frame$value!="NA") inc.frame = as.data.frame(inc.tri) inc.frame$type = "incurred" inc.frame = subset(inc.frame,subset = inc.frame$value!="NA") #in practical situations we are likely to have a view on the ultimate loss cost. For this blogpost, let's suppose that we want to run with a 90% loss ratio for all years ULR = 0.9 net.prem = unique(StateFarm.Data$EarnedPremNet_B) net.ult = data.frame(cbind(AccidentYear=1988:1997,Ult=ULR*net.prem)) #create a frame showing the ultimates is the same format as the paid and incurred information - good tip here on vlookup approach, https://www.rforexcelusers.com/vlookup-in-r/ ult.frame = merge(net.ult[, c("AccidentYear", "Ult")], inc.frame[, c("AccidentYear", "DevelopmentLag","type")]) #reorder and rename columns, then override the type ult.frame = ult.frame[,c(1,3,2,4)] colnames(ult.frame) = colnames(inc.frame) ult.frame$type = "Ult" ult.frame = ult.frame[ order(ult.frame$AccidentYear, ult.frame$DevelopmentLag), ] #combine the three frames together: data.combined = data.frame(rbind(paid.frame,inc.frame,ult.frame)) #clean up temp variables rm(paid.frame,inc.frame,ult.frame)
Having performed all of that manipulation, we can finally punch-in our charting code:
p = ggplot(data=data.combined, aes(x=DevelopmentLag, y = value/1e6, color=type)) + geom_point()+geom_line() + facet_wrap(~AccidentYear) #ok, let's improve the output in a few ways - firstly, limits and ticks p = p+ylim(0,20) p = p+ scale_x_continuous(breaks=seq(0, 10, 2)) # Ticks from 0-10, every 2 p = p+labs(title = "State Farm Mutual - Motor Loss Development", subtitle = "Incurred losses show downward development, highlighting redundancy of case reserves", caption = "Source: http://www.casact.org/research/reserve_data/ppauto_pos.csv", x = "Development Yr", y = "USDm") p = p+theme(axis.text=element_text(size=8), axis.title=element_text(size=10,face="italic")) #color scale p = p+ scale_colour_brewer(palette = "Dark2") #other options are available here: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
to generate the intended visualisation including details of specific notes for the reader’s interest:
The final output shows that the incurred loss position generally reduces with time. In other words, it appears as though State Farm Mutual’s case estimates are prudent and there is support for posting negative IBNR. In a future post, I will discuss how one can derive a more appropriate ultimate value than just applying a flat 90% loss ratio assumption for all years.
From my own experience, while Excel offers fast and easy ways to produce initial visualisations, if you have multiple charts it is painful to arrange them in a panel and, typically, each chart requires separate and sequential updates. While automating is possible using Excel VBA, I find that the syntax for chart objects is tricky particularly when dealing with series.collections. In contrast, R has no issues with panel plots, tends to be more resilient and is very customisable.
So, that’s me signing off. I leave you with the following thoughts on data and the imperfect world we deal with as actuaries… lastly, if you’d like a copy of the full R script, please visit here
“Life is made up of a series of judgments on insufficient data, and if we waited to run down all our doubts, it would flow past us.” – Learned Hand
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.