Blog Archives

Computing AIC on a Validation Sample

July 29, 2015
By
Computing AIC on a Validation Sample

This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection. > library(splines) > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="log"))) 438.6314 > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="identity"))) 436.3997 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="log"))) 425.6434 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity"))) 428.7195 And I’ve been asked...

Read more »

Modelling Occurence of Events, with some Exposure

July 28, 2015
By
Modelling Occurence of Events, with some Exposure

This afternoon, an interesting point was raised, and I wanted to get back on it (since I did publish a post on that same topic a long time ago). How can we adapt a logistic regression when all the observations do not have the same exposure. Here the model is the following: , the occurence of an event  on the period ...

Read more »

Visualising Claims Frequency

July 28, 2015
By
Visualising Claims Frequency

A few years ago, I did publish a post to visualize and empirical claims frequency in a portfolio. I wanted to update the code. Here is a code to get a dataset, sinistre <- read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",header=TRUE,sep=";") sinistres=sinistre contrat <- read.table("http://freakonometrics.free.fr/contractACT2040.txt",header=TRUE,sep=";") T=table(sinistres$nocontrat) T1=as.numeric(names(T)) T2=as.numeric(T) nombre1 = data.frame(nocontrat=T1,nbre=T2) I = contrat$nocontrat%in%T1 T1= contrat$nocontrat nombre2 = data.frame(nocontrat=T1,nbre=0) nombre=rbind(nombre1,nombre2) basenb = merge(contrat,nombre) head(basenb) basesin=merge(sinistres,contrat)...

Read more »

Choosing a Classifier

July 21, 2015
By
Choosing a Classifier

In order to illustrate the problem of chosing a classification model consider some simulated data, > n = 500 > set.seed(1) > X = rnorm(n) > ma = 10-(X+1.5)^2*2 > mb = -10+(X-1.5)^2*2 > M = cbind(ma,mb) > set.seed(1) > Z = sample(1:2,size=n,replace=TRUE) > Y = ma*(Z==1)+mb*(Z==2)+rnorm(n)*5 > df = data.frame(Z=as.factor(Z),X,Y) A first strategy is to split the dataset...

Read more »

An Update on Boosting with Splines

July 2, 2015
By
An Update on Boosting with Splines

In my previous post, An Attempt to Understand Boosting Algorithm(s), I was puzzled by the boosting convergence when I was using some spline functions (more specifically linear by parts and continuous regression functions). I was using > library(splines) > fit=lm(y~bs(x,degree=1,df=3),data=df) The problem with that spline function is that knots seem to be fixed. The iterative boosting algorithm is start with...

Read more »

Variable Selection using Cross-Validation (and Other Techniques)

July 1, 2015
By
Variable Selection using Cross-Validation (and Other Techniques)

A natural technique to select variables in the context of generalized linear models is to use a stepŵise procedure. It is natural, but contreversial, as discussed by Frank Harrell  in a great post, clearly worth reading. Frank mentioned about 10 points against a stepwise procedure. It yields R-squared values that are badly biased to be high. The F and chi-squared tests quoted...

Read more »

An Attempt to Understand Boosting Algorithm(s)

June 26, 2015
By
An Attempt to Understand Boosting Algorithm(s)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page. Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also...

Read more »

‘Variable Importance Plot’ and Variable Selection

June 17, 2015
By
‘Variable Importance Plot’ and Variable Selection

Classification trees are nice. They provide an interesting alternative to a logistic regression.  I started to include them in my courses maybe 7 or 8 years ago. The question is nice (how to get an optimal partition), the algorithmic procedure is nice (the trick of splitting according to one variable, and only one, at each node, and then to move forward, never backward),...

Read more »

p-hacking, or cheating on a p-value

June 11, 2015
By
p-hacking, or cheating on a p-value

Yesterday evening, I discovered some interesting slides on False-Positives, p-Hacking, Statistical Power, and Evidential Value, via @UCBITSS ‘s post on Twitter. More precisely, there was this slide on how cheating (because that’s basically what it is) to get a ‘good’ model (by targeting the p-value) As mentioned by @david_colquhoun  one should be careful when reading the slides : some statistician might have a heart attack...

Read more »

Who interacts on Twitter during a conference (#JDSLille)

June 7, 2015
By
Who interacts on Twitter during a conference (#JDSLille)

Disclamer: This is a joint post with Avner Bar-Hen, a.k.a. @a_bh, Benjamin Guedj, a.k.a. @bguedj and Nathalie Villa, a.k.a. @Natty_V2 Organised annually since 1970 by the French Society of Statistics (SFdS), the Journées de Statistique (JdS) are the most important scientific event of the French statistical community. More than 400 researchers, teachers and practitioners meet at each edition. In 2015,...

Read more »