Faster Than Excel: Painlessly Merge Data into Actuarial Loss Development Triangles with R

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The objective of this post is to create a method which easily combines loss runs, or listings of insurance claims, into triangles. Using only Excel, the common method is to create links between the excel files which must be updated manually for each new evaluation. This is prone to human error and is time-consuming. Using a script to merge the files first and then create a triangle saves time and is more consistent.

Example of the conventional linked Excel triangle:

For accident year 2016, all the diagonal values need to be linked to a data file that was last evaluated in 2016. For 2015, the links go to the 2015 file, for 2014, the 2014 file, ect. This is time consuming.

See Wikipedia for a definition of Loss Development Triangles and why they are useful.

Methods

I use the packages plyr, dplyr, tidyr, and lubridate to manipulate the data once it is loaded into R. The package readxl reads the excel files from the windows file directory. It is worth noting that all five of these packages are included in Hadley Wickham’s tidyverse package. The package ChainLadder has a variety of tools for actuarial reserve analysis.

library(dplyr)
library(readxl)
library(plyr)
library(tidyr)
library(ChainLadder)
library(lubridate)

Step 1: Organize the Excel Files

Copy all of the excel files into the same windows folder, inside the working directory of the R project. It is important to have *only* the files that are going to be loaded into R in this folder.

On my computer, my file structure is as follows:

C:/Users/Sam/Desktop/[R project working directory]/[folder with excel claims listing files to aggregate]/[file 1, file 2, file 3, etc]

Using the command list.files(), R automatically looks in the working directory and returns a vector of all the the file names which it sees. For example, R sees the data files in the current working directory:

wd_path = "C:/Users/Sam/Desktop/projects/Excel Aggregate Loop/excel files"
list.files(wd_path)
## [1] "claims listing 2011.xlsx" "claims listing 2012.xlsx"
## [3] "claims listing 2013.xlsx" "claims listing 2014.xlsx"
## [5] "claims listing 2015.xlsx" "claims listing 2016.xlsx"

R can then loop through each of these files and perform any action. If there were 100 files, or 1000 files in this directory, this would still work.

file_names = list.files(wd_path)
file_paths = paste(wd_path, "/", file_names, sep = "")
head(read_excel(file_paths[4]) %>% select(-3,-4,-5))
## # A tibble: 4 x 6
## name `license number` age `file year` `loss date` paid
##      
## 1 jeff 3 23 2014 2011-01-01 400
## 2 sue 3 43 2014 2012-01-01 400
## 3 mark 2 55 2014 2013-01-01 200
## 4 sarah 1 100 2014 2014-01-01 500

In order to evaluate the age of the losses, we need to take into account when each loss was evaluated. This is accomplished by going into Excel and adding in a column for “file year”, which specifies the year of evaluation of the file. For instance, for the “claim listing 2013” file, all of the claims have a “2013” in the “file year” column.

Step 2: Load the Data into R

Initialize a data frame which will store the aggregated loss run data from each of the excel files. **The names of this data frame need to be the names of excel file columns which need to be aggregated.** For instance, these could be “reported”, “Paid Loss”, “Case Reserves”, or “Incurred Loss”. If the excel files have different names for the same quantities (ie, “Paid Loss” vs. “paid loss”), then they should be renamed within excel first.

merged_data = as_data_frame(matrix(nrow = 0, ncol = 3))
names(merged_data) = c("file year", "loss date", "paid")

Someone once said “if you need to use a ‘for’ loop in R, then you are doing something wrong”. Vectorized functions are faster and easier to implement. The function my_extract below takes in the file name of the excel file and returns a data frame with only the columns which are selected.

excel_file_extractor = function(cur_file_name){
  read_excel(cur_file_name[1], sheet = "Sheet1", col_names = T) %>%
    select(`file year`, `file year`, `loss date`, paid) %>%
    rbind(merged_data)
}

Apply the function to all of the files in the folder that you created. Obviously, if you had 100 excel files this would still work just as effectively.

From the plyr package, ldply takes in a list and returns a data frame. The way to remember this is by the ordering of the letters (“list”-“data frame”-“ply”). For example, ff we wanted to read in a data frame and return a data frame, it would be ddply.

loss_run_data = ldply(file_paths, excel_file_extractor)
names(loss_run_data) = c("file_year", "loss_date", "paid losses")

The data now has only the columns what we selected, despite the fact that the loss run files had different columns in each of the files.

glimpse(loss_run_data)
## Variables: 3
## $ file_year  2011, 2012, 2012, 2013, 2013, 2013, 2014, 2014, 20...
## $ loss_date  2011-01-01, 2011-01-01, 2012-01-01, 2011-01-01, 2...
## $ paid losses  100, 200, 300, 300, 350, 100, 400, 400, 200, 500, ...



head(loss_run_data)
## file_year loss_date paid losses
## 1 2011 2011-01-01 100
## 2 2012 2011-01-01 200
## 3 2012 2012-01-01 300
## 4 2013 2011-01-01 300
## 5 2013 2012-01-01 350
## 6 2013 2013-01-01 100

Step 3: Create Development Triangles

Finally, once we have the loss run combined, we just need to create a triangle. This is made easy by the `as.triangle` function from the `ChainLadder` package.

The only manual labor required in excel was to go into each file and create the `file year` column, which was just the year of evaluation of each loss run file.

loss_run_data  = loss_run_data %>% 
  mutate(accident_year = as.numeric(year(ymd(loss_date))),
         maturity_in_months = (file_year - accident_year)*12)

merged_triangle = as.triangle(loss_run_data, 
                              dev = "maturity_in_months", 
                              origin = "accident_year", 
                              value = "paid losses")

merged_triangle

Within the package `ChainLadder` is a plot function, which shows the development of losses by accident year. Because these are arbitrary amounts, the plot is not realistic.

plot(merged_triangle, 
     main = "Paid Losses vs Maturity by Accident Year",
     xlab = "Maturity in Months", 
     ylab = "Paid Losses")

Conclusion

When it comes to aggregating excel files, R can be faster and more consistent than linking together each of the excel files, and once this script is set in place, making modifications to the data can be done easily by editing the `exel_file_extractor` function.

Download the data and Rmd files for this article from my github account

    Related Post

    1. The Mcomp Package
    2. Building a Hacker News scraper with 8 lines of R code using rvest library
    3. Building a Telecom Dictionary scraping web using rvest in R
    4. Scraping Javascript-rendered web content using R
    5. Analysing Cryptocurrency Market in R

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)