Blog Archives

On NCDF Climate Datasets

September 3, 2015
By
On NCDF Climate Datasets

Mid november, a nice workshop on big data and environment will be organized, in Argentina, We will talk a lot about climate models, and I wanted to play a little bit with those data, stored on http://dods.ipsl.jussieu.fr/mc2ipsl/. Since Ewen (aka @3wen) has been working on those datasets recently, he kindly told me how to read those datasets (in some ncdf format). He did show me...

Read more »

“A 99% TVaR is generally a 99.6% VaR”

August 29, 2015
By
“A 99% TVaR is generally a 99.6% VaR”

Almost 6 years ago, I posted a brief comment on a sentence I found surprising, by that time, discovered in a report claiming that the expected shortfall  at the 99 % level corresponds quite closely to the  value-at-risk at a 99.6% level which was inspired by a remark in Swiss Experience report, expected shortfall  on a 99% confidence level […} corresponds to approximately 99.6% to...

Read more »

Pricing Game

August 22, 2015
By

In November, with Romuald Elie and Jérémie Jakubowicz, we will organize a session during the 100% Actuaires day, in Paris, based on a “pricing game“. We provide two datasets, (motor insurance, third party claims), with 2  years of experience, and 100,000 policies. Each ‘team’ has to submit premium proposal for 36,000 potential insured for the third year (third party, material + bodily injury). We will work as a ‘price...

Read more »

Computing AIC on a Validation Sample

July 29, 2015
By
Computing AIC on a Validation Sample

This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection. > library(splines) > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="log"))) 438.6314 > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="identity"))) 436.3997 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="log"))) 425.6434 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity"))) 428.7195 And I’ve been asked...

Read more »

Modelling Occurence of Events, with some Exposure

July 28, 2015
By
Modelling Occurence of Events, with some Exposure

This afternoon, an interesting point was raised, and I wanted to get back on it (since I did publish a post on that same topic a long time ago). How can we adapt a logistic regression when all the observations do not have the same exposure. Here the model is the following: , the occurence of an event  on the period ...

Read more »

Visualising Claims Frequency

July 28, 2015
By
Visualising Claims Frequency

A few years ago, I did publish a post to visualize and empirical claims frequency in a portfolio. I wanted to update the code. Here is a code to get a dataset, sinistre <- read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",header=TRUE,sep=";") sinistres=sinistre contrat <- read.table("http://freakonometrics.free.fr/contractACT2040.txt",header=TRUE,sep=";") T=table(sinistres$nocontrat) T1=as.numeric(names(T)) T2=as.numeric(T) nombre1 = data.frame(nocontrat=T1,nbre=T2) I = contrat$nocontrat%in%T1 T1= contrat$nocontrat nombre2 = data.frame(nocontrat=T1,nbre=0) nombre=rbind(nombre1,nombre2) basenb = merge(contrat,nombre) head(basenb) basesin=merge(sinistres,contrat)...

Read more »

Choosing a Classifier

July 21, 2015
By
Choosing a Classifier

In order to illustrate the problem of chosing a classification model consider some simulated data, > n = 500 > set.seed(1) > X = rnorm(n) > ma = 10-(X+1.5)^2*2 > mb = -10+(X-1.5)^2*2 > M = cbind(ma,mb) > set.seed(1) > Z = sample(1:2,size=n,replace=TRUE) > Y = ma*(Z==1)+mb*(Z==2)+rnorm(n)*5 > df = data.frame(Z=as.factor(Z),X,Y) A first strategy is to split the dataset...

Read more »

An Update on Boosting with Splines

July 2, 2015
By
An Update on Boosting with Splines

In my previous post, An Attempt to Understand Boosting Algorithm(s), I was puzzled by the boosting convergence when I was using some spline functions (more specifically linear by parts and continuous regression functions). I was using > library(splines) > fit=lm(y~bs(x,degree=1,df=3),data=df) The problem with that spline function is that knots seem to be fixed. The iterative boosting algorithm is start with...

Read more »

Variable Selection using Cross-Validation (and Other Techniques)

July 1, 2015
By
Variable Selection using Cross-Validation (and Other Techniques)

A natural technique to select variables in the context of generalized linear models is to use a stepŵise procedure. It is natural, but contreversial, as discussed by Frank Harrell  in a great post, clearly worth reading. Frank mentioned about 10 points against a stepwise procedure. It yields R-squared values that are badly biased to be high. The F and chi-squared tests quoted...

Read more »

An Attempt to Understand Boosting Algorithm(s)

June 26, 2015
By
An Attempt to Understand Boosting Algorithm(s)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page. Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)