Blog Archives

An Attempt to Understand Boosting Algorithm(s)

June 26, 2015
By
An Attempt to Understand Boosting Algorithm(s)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page. Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also...

Read more »

‘Variable Importance Plot’ and Variable Selection

June 17, 2015
By
‘Variable Importance Plot’ and Variable Selection

Classification trees are nice. They provide an interesting alternative to a logistic regression.  I started to include them in my courses maybe 7 or 8 years ago. The question is nice (how to get an optimal partition), the algorithmic procedure is nice (the trick of splitting according to one variable, and only one, at each node, and then to move forward, never backward),...

Read more »

p-hacking, or cheating on a p-value

June 11, 2015
By
p-hacking, or cheating on a p-value

Yesterday evening, I discovered some interesting slides on False-Positives, p-Hacking, Statistical Power, and Evidential Value, via @UCBITSS ‘s post on Twitter. More precisely, there was this slide on how cheating (because that’s basically what it is) to get a ‘good’ model (by targeting the p-value) As mentioned by @david_colquhoun  one should be careful when reading the slides : some statistician might have a heart attack...

Read more »

Who interacts on Twitter during a conference (#JDSLille)

June 7, 2015
By
Who interacts on Twitter during a conference (#JDSLille)

Disclamer: This is a joint post with Avner Bar-Hen, a.k.a. @a_bh, Benjamin Guedj, a.k.a. @bguedj and Nathalie Villa, a.k.a. @Natty_V2 Organised annually since 1970 by the French Society of Statistics (SFdS), the Journées de Statistique (JdS) are the most important scientific event of the French statistical community. More than 400 researchers, teachers and practitioners meet at each edition. In 2015,...

Read more »

Data Science: from Small to Big Data

May 29, 2015
By
Data Science: from Small to Big Data

This Tuesday, I will be in Leuven (in Belgium) at the ACP meeting to give  a talk on Data Science: from Small to Big Data. The talk will take place in the Faculty Club from 6 till 8 pm. Slides could be found online (with animated pictures). As usual, comments are welcome.

Read more »

Copulas and Financial Time Series

May 12, 2015
By
Copulas and Financial Time Series

I was recently asked to write a survey on copulas for financial time series. The paper is, so far, unfortunately, in French, and is available on https://hal.archives-ouvertes.fr/. There is a description of various models, including some graphs and statistical outputs, obtained from read data. To illustrate, I’ve been using weekly log-returns of (crude) oil prices, Brent, Dubaï and Maya....

Read more »

Working with “large” datasets, with dplyr and data.table

May 4, 2015
By
Working with “large” datasets, with dplyr and data.table

A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, “we should forget about small efficiencies, say about 97% of the time: premature optimization is...

Read more »

I Fought the (distribution) Law (and the Law did not win)

April 27, 2015
By
I Fought the (distribution) Law (and the Law did not win)

A few days ago, I was asked if we should spend a lot of time to choose the distribution we use, in GLMs, for (actuarial) ratemaking. On that topic, I usually claim that the family is not the most important parameter in the regression model. Consider the following dataset > db <- data.frame(x=c(1,2,3,4,5),y=c(1,2,4,2,6)) > plot(db,xlim=c(0,6),ylim=c(-1,8),pch=19) To visualize a regression...

Read more »

Visualising a Classification in High Dimension, part 2

April 9, 2015
By
Visualising a Classification in High Dimension, part 2

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a...

Read more »

Classification with Categorical Variables (the fuzzy side)

April 9, 2015
By
Classification with Categorical Variables (the fuzzy side)

The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) 42.98 > mean(cars$dist) 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=cars), + newdata=data.frame(speed=mean(cars$speed))) 42.98 The geometric interpretation is that the...

Read more »