Blog Archives

p-hacking, or cheating on a p-value

June 11, 2015
By
p-hacking, or cheating on a p-value

Yesterday evening, I discovered some interesting slides on False-Positives, p-Hacking, Statistical Power, and Evidential Value, via @UCBITSS ‘s post on Twitter. More precisely, there was this slide on how cheating (because that’s basically what it is) to get a ‘good’ model (by targeting the p-value) As mentioned by @david_colquhoun  one should be careful when reading the slides : some statistician might have a heart attack...

Read more »

Who interacts on Twitter during a conference (#JDSLille)

June 7, 2015
By
Who interacts on Twitter during a conference (#JDSLille)

Disclamer: This is a joint post with Avner Bar-Hen, a.k.a. @a_bh, Benjamin Guedj, a.k.a. @bguedj and Nathalie Villa, a.k.a. @Natty_V2 Organised annually since 1970 by the French Society of Statistics (SFdS), the Journées de Statistique (JdS) are the most important scientific event of the French statistical community. More than 400 researchers, teachers and practitioners meet at each edition. In 2015,...

Read more »

Data Science: from Small to Big Data

May 29, 2015
By
Data Science: from Small to Big Data

This Tuesday, I will be in Leuven (in Belgium) at the ACP meeting to give  a talk on Data Science: from Small to Big Data. The talk will take place in the Faculty Club from 6 till 8 pm. Slides could be found online (with animated pictures). As usual, comments are welcome.

Read more »

Copulas and Financial Time Series

May 12, 2015
By
Copulas and Financial Time Series

I was recently asked to write a survey on copulas for financial time series. The paper is, so far, unfortunately, in French, and is available on https://hal.archives-ouvertes.fr/. There is a description of various models, including some graphs and statistical outputs, obtained from read data. To illustrate, I’ve been using weekly log-returns of (crude) oil prices, Brent, Dubaï and Maya....

Read more »

Working with “large” datasets, with dplyr and data.table

May 4, 2015
By
Working with “large” datasets, with dplyr and data.table

A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, “we should forget about small efficiencies, say about 97% of the time: premature optimization is...

Read more »

I Fought the (distribution) Law (and the Law did not win)

April 27, 2015
By
I Fought the (distribution) Law (and the Law did not win)

A few days ago, I was asked if we should spend a lot of time to choose the distribution we use, in GLMs, for (actuarial) ratemaking. On that topic, I usually claim that the family is not the most important parameter in the regression model. Consider the following dataset > db <- data.frame(x=c(1,2,3,4,5),y=c(1,2,4,2,6)) > plot(db,xlim=c(0,6),ylim=c(-1,8),pch=19) To visualize a regression...

Read more »

Visualising a Classification in High Dimension, part 2

April 9, 2015
By
Visualising a Classification in High Dimension, part 2

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a...

Read more »

Classification with Categorical Variables (the fuzzy side)

April 9, 2015
By
Classification with Categorical Variables (the fuzzy side)

The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) 42.98 > mean(cars$dist) 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=cars), + newdata=data.frame(speed=mean(cars$speed))) 42.98 The geometric interpretation is that the...

Read more »

Another Interactive Map for the Cholera Dataset

March 31, 2015
By
Another Interactive Map for the Cholera Dataset

Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include. This time, we do not use openstreetmap. The first part is still the same, to get the data, > require(rleafmap) > library(sp) > library(rgdal) > library(maptools) >...

Read more »

Interactive Maps for John Snow’s Cholera Data

March 28, 2015
By
Interactive Maps for John Snow’s Cholera Data

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the  leaflet package, devtools::install_github("rstudio/leaflet") require(leaflet) To see what can be done with that package, we will use one more time the John Snow’s cholera dataset, discussed in previous...

Read more »