# R for actuarial science

January 10, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

As mentioned in the Appendix of Modern Actuarial Risk Theory, “R (and S) is the ‘lingua franca’ of data analysis and statistical computing, used in academia, climate research, computer science, bioinformatics, pharmaceutical industry, customer analytics, data mining, finance and by some insurers. Apart from being stable, fast, always up-to-date and very versatile, the chief advantage of R is that it is available to everyone free of charge. It has extensive and powerful graphics abilities, and is developing rapidly, being the statistical tool of choice in many academic environments.

R is based on the S statistical programming language developed by Joe Chambers at Bell labs in the 80’s. To be more specific, R is an open-source implementation of the S language, developed by Robert Gentlemn and Ross Ihaka. It is a vector based language, which makes it extremely interesting for actuarial computations. For instance, consider some Life Tables,

> TD[39:52,]       > TV[39:52,]
Age    Lx         Age    Lx
39  38 95237          38 97753
40  39 94997          39 97648
41  40 94746          40 97534
42  41 94476          41 97413
43  42 94182          42 97282
44  43 93868          43 97138
45  44 93515          44 96981
46  45 93133          45 96810
47  46 92727          46 96622
48  47 92295          47 96424
49  48 91833          48 96218
50  49 91332          49 95995
51  50 90778          50 95752
52  51 90171          51 95488

Those (French) Life Tables can be found here

> TD <- read.table(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/TV8890.csv",sep=";",header=TRUE)

From those vectors, it is possible to construct the matrix of death probabilities, , using for instance

>  Lx <- TD$Lx > m <- length(Lx) > p <- matrix(0,m,m); d <- p > for(i in 1:(m-1)){ + p[1:(m-i),i] <- Lx[1+(i+1):m]/Lx[i+1] + d[1:(m-i),i] <- (Lx[(1+i):(m)]-Lx[(1+i):(m)+1])/Lx[i+1]} > diag(d[(m-1):1,]) <- 0 > diag(p[(m-1):1,]) <- 0 > q <- 1-p One can compute easily, e.g., the (curtate) expectation of life defined as and one can compute the vector of life expectancy, at various ages , as > life.exp = function(x){sum(p[1:nrow(p),x])} > e = Vectorize(life.exp)(1:m) An actually, any kind of actuarial quantity can be derived from those matrices. The expected present value (or actuarial value) of a temporary life annuity-due is, for instance, The code to compute those functions is here > for(j in 1:(m-1)){ adots[,j]<-cumsum(1/(1+i)^(0:(m-1))*c(1,p[1:(m-1),j])) } or consider the expected present value of a term insurance with the following code > for(j in 1:(m-1)){ A[,j]<-cumsum(1/(1+i)^(1:m)*d[,j]) } Some more details can be found in the first part of the notes of the crash courses of last summer, in Meielisalp. Vector – or matrices – are extremely convenient to work with, when dealing with life contingencies. It is also possible to model prospective mortality. Here, the mortality is not only function of the age , but also time , > t(DTF)[1:10,1:10] 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 0 64039 61635 56421 53321 52573 54947 50720 53734 47255 46997 1 12119 11293 10293 10616 10251 10514 9340 10262 10104 9517 2 6983 6091 5853 5734 5673 5494 5028 5232 4477 4094 3 4329 3953 3748 3654 3382 3283 3294 3262 2912 2721 4 3220 3063 2936 2710 2500 2360 2381 2505 2213 2078 5 2284 2149 2172 2020 1932 1770 1788 1782 1789 1751 6 1834 1836 1761 1651 1664 1433 1448 1517 1428 1328 7 1475 1534 1493 1420 1353 1228 1259 1250 1204 1108 8 1353 1358 1255 1229 1251 1169 1132 1134 1083 961 9 1175 1225 1154 1008 1089 981 1027 1025 957 885 Thus, we now have a force of mortality matrix , or surface It is also possible to use R packages to estimate a Lee-Carter model of the mortality rate, > library(demography) > MUH =matrix(DEATH$Male/EXPOSURE$Male,nL,nC) > POPH=matrix(EXPOSURE$Male,nL,nC)
> BASEH <- demogdata(data=MUH, pop=POPH, ages=AGE, years=YEAR, type="mortality",
+ label="France", name="Hommes", lambda=1)
> RES=residuals(LCH,"pearson")

One can easily study residuals, for instance as a function of the age,

or a function of the year,

Some more details can be found in the second part of the notes of the crash courses of last summer, in Meielisalp.

R is also interesting because of its huge number of libraries, that can be used for predictive modeling. One can easily use smoothing functions in regression, or regression trees,

> TREE = tree((nbr>0)~ageconducteur,data=sinistres,split="gini",mincut = 1)
> age = data.frame(ageconducteur=18:90)
> y1 = predict(TREE,age)
> reg = glm((nbr>0)~bs(ageconducteur),data=sinistres,family="binomial")
> y = predict(reg,age,type="response")

Some practitioners might be scared because the legend claims that R is not as good as SAS to handle large databases. Actually, a lot of functions can be used to import datasets. The most convenient one is probably

> baseCOUT = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
>  tail(baseCOUT,4)
numeropol  debut_pol    fin_pol freq_paiement langue  type_prof alimentation type_territoire
6512     87291 2002-10-16 2003-01-22       mensuel      A Professeur   Vegetarien          Urbain
6513     87301 2002-10-01 2003-09-30       mensuel      A Technicien   Vegetarien          Urbain
6514     87417 2002-10-24 2003-10-21       mensuel      F Technicien   Vegetalien     Semi-urbain
6515     88128 2003-01-17 2004-01-16       mensuel      F     Avocat   Vegetarien     Semi-urbain
utilisation presence_alarme marque_voiture sexe exposition age duree_permis age_vehicule i   coutsin
6512 Travail-occasionnel             oui           FORD    M  0.2684932  47           29           28 1 1274.5901
6513              Loisir             oui          HONDA    M  0.9972603  44           24           25 1  278.0745
6514 Travail-occasionnel             non     VOLKSWAGEN    F  0.9917808  23            3           11 1  403.1242
6515              Loisir             non           FIAT    F  0.9972603  23            4           11 1  230.9565

But if the dataset is too large, it is also possible to specify which variables might be interesting, using

> mycols = rep("NULL", 18)
> mycols[c(1,4,5,12,13,14,18)] <- NA
numeropol freq_paiement langue sexe exposition age    coutsin
1         6        annuel      A    M  0.9945205  42   279.5839
2        27       mensuel      F    M  0.2438356  51   814.1677
3        27       mensuel      F    M  1.0000000  53   136.8634
4        76       mensuel      F    F  1.0000000  42   608.7267

It is also possible (before running a code on the entire dataset) to import only the first lines of the dataset.

> baseCOUTsubCR = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
> tail(baseCOUTsubCR,4)
numeropol freq_paiement langue sexe exposition age   coutsin
97       1193       mensuel      F    F  0.9972603  55  265.0621
98       1204       mensuel      F    F  0.9972603  38 9547.7267
99       1231       mensuel      F    M  1.0000000  40  442.7267
100      1245        annuel      F    F  0.6767123  48  179.1925

It is also possible to import a zipped file. The file itself has a smaller size, and it can usually be imported faster.

> import.zip = function(file){
+ temp = tempfile()
> system.time(import.zip("http://freakonometrics.free.fr/baseFREQ.csv.zip"))
trying URL 'http://freakonometrics.free.fr/baseFREQ.csv.zip'
Content type 'application/zip' length 692655 bytes (676 Kb)
opened URL
==================================================
user  system elapsed
0.762       0.029       4.578
user  system elapsed
0.591       0.072       9.277

Finally, note that it is possible to import any kind of dataset, not only a text file. Even a Microsoft Excel folder. On a Windows computer, one can use SQL queries

> sheet = "c:\\Documents and Settings\\user\\excelsheet.xls"
> connection = odbcConnectExcel(sheet)
> query = paste("SELECT * FROM",spreadsheet$TABLE_NAME[1],sep=" ") > result = sqlQuery(connection,query) Then, once the dataset is imported, several functions can be used, > cost = aggregate(coutsin~ AgeSex,mean, data=baseCOUT) > frequency = merge(aggregate(nbsin~ AgeSex,sum, data=baseFREQ), + aggregate(exposition~ AgeSex,sum, data=baseFREQ)) > frequency$freq = frequency$nbsin/frequency$exposition
> base.freq.cost = merge(frequency, cost)

Finally, R is interesting for its graphical interface. “If you can picture it in your head, chances are good that you can make it work in R. R makes it easy to read data, generate lines and points, and place them where you want them. Its very flexible and super quick. When youve only got two or three hours until deadline, R can be brilliant” as said Amanda Cox, a graphics editor at the New York Times. “R is particularly valuable in deadline situations when data is scant and time is precious.”.
Several cases were considered on the blog http ://chartsnthings.tumblr.com/…. First, we start with a simple graph, here State Government control in the US

Then try to find a nice visual representation, e.g.

And finally, you can just print it in your favorite newspaper,

And you can get any kind of graphs,

Graphs are important. “Its not just about producing graphics for publication. Its about playing around and making a bunch of graphics that help you explore your data. This kind of graphical analysis is a really useful way to help you understand what you’re dealing with, because if you cant see it, you cant really understand it. But when you start graphing it out, you can really see what you’ve got” as said Peter Aldhous, San Francisco bureau chief of New Scientist magazine. Even for actuaries. “The commercial insurance underwriting process was rigorous but also quite subjective and based on intuition. R enables us to communicate our analytic results in appealing and innovative ways to non-technical audiences through rapid development lifecycles. R helps us show our clients how they can improve their processes and effectiveness by enabling our consultants to conduct analyses efficiently”, as explained by John Lucker, team of advanced analytics professionals at Deloitte Consulting Principal, in http://blog.revolutionanalytics.com/r-is-hot/. See also Andrew Gelman’s view, on graphs, http://www.stat.columbia.edu/…

So yes, actuaries might be interested to use R for actuarial communication, as mentioned in http ://www.londonr.org/…

The Actuarial Toolkit (see http ://www.actuaries.org.uk/…) stresses the interest of R, “The power of the language R lies with its functions for statistical modelling, data analysis and graphics ; its ability to read and write data from various data sources; as well as the opportunity to embed R in excel or other languages like VBA. In the way SAS is good for data manipulations, R is superior for modelling and graphical output“.

From 2011, Asia Capital Reinsurance Group (ACR) uses R to Solve Big Data Challenges (see http ://www.reuters.com/…). And Lloyd’s uses motion charts created with R to provide analysis to investors (as discussed on http ://blog.revolutionanalytics.com/…)

A lot of information can be found on http ://jeffreybreen.wordpress.com/…

Markus Gesmann mentioned on his blog a lot of interesting graphs used for actuarial reporting, http ://lamages.blogspot.ca/…

Further, R is free. Which can be compared with SAS, $6,000 per PC, or$28,000 per processor on a server (as mentioned on http ://en.wikipedia.org/…)

It is also becoming more and more popular, as a programming language. As mentioned on this month Transparent Language Popularity (see http ://lang-index.sourceforge.net/), R is ranked 12. Far away after C or Java, but before Matlab (22) or SAS (27). On StackOverFlow (see http ://stackoverflow.com/) is also far being C++ (399,232 occurrences) or Java (348,418), but with 21,818 occurrences, it appears before Matlab (14,580) and SAS (899). As mentioned on http ://r4stats.com/articles/popularity/ R is becoming more and more popular, on listserv discussion traffic

It is clearly the most popular software in data analysis, as mentioned by the Rexer Analytics survey, in 2009

What about actuaries ? In a survey (see http ://palisade.com/…), R was not extremely popular.

If we consider only statistical softwares, SAS is still far ahead, among UK and CAS actuaries

But, as mentioned by Mike King, Quantitative Analyst, Bank of America, “I cant think of any programming language that has such an incredible community of users. If you have a question, you can get it answered quickly by leaders in the field. That means very little downtime.” This was also mentioned by Glenn Meyers, in the Actuarial Review “The most powerful reason for using R is the community” (in http ://nytimes.com/…). For instance, http ://r-bloggers.com/ has contributions from more than 425 R users.

As said by Bo Cowgill, from Google “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.

### Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...