Blog Archives

Visualising a Classification in High Dimension, part 2

April 9, 2015
By
Visualising a Classification in High Dimension, part 2

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a...

Read more »

Classification with Categorical Variables (the fuzzy side)

April 9, 2015
By
Classification with Categorical Variables (the fuzzy side)

The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) 42.98 > mean(cars$dist) 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=cars), + newdata=data.frame(speed=mean(cars$speed))) 42.98 The geometric interpretation is that the...

Read more »

Another Interactive Map for the Cholera Dataset

March 31, 2015
By
Another Interactive Map for the Cholera Dataset

Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include. This time, we do not use openstreetmap. The first part is still the same, to get the data, > require(rleafmap) > library(sp) > library(rgdal) > library(maptools) >...

Read more »

Interactive Maps for John Snow’s Cholera Data

March 28, 2015
By
Interactive Maps for John Snow’s Cholera Data

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the  leaflet package, devtools::install_github("rstudio/leaflet") require(leaflet) To see what can be done with that package, we will use one more time the John Snow’s cholera dataset, discussed in previous...

Read more »

Spliting a Node in a Tree

March 23, 2015
By
Spliting a Node in a Tree

If we grow a tree with standard functions in R, on the same dataset used to introduce classification tree in some previous post, > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + head=TRUE,sep=";") > library(rpart) > cart<-rpart(PRONO~.,data=MYOCARDE) we get > library(rpart.plot) > library(rattle) > prp(cart,type=2,extra=1) The first step is to split the first node (based on the whole dataset). To split it, we...

Read more »

Regression Models, It’s Not Only About Interpretation

March 22, 2015
By
Regression Models, It’s Not Only About Interpretation

Yesterday, I did upload a post where I tried to show that “standard” regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess something related to...

Read more »

Forecast, Automatic Routines vs. Experience

March 18, 2015
By
Forecast, Automatic Routines vs. Experience

This morning, in our Time Series course, we’ve been playing with some data I got from google.ca/trends/. Actually, we’ve been playing on some old version, downloaded 18 months ago (discussed in a previous post, in French). > urls = "http://freakonometrics.free.fr/report-headphones-2015.csv" > report=read.table( + urls,skip=4,header=TRUE,sep=",",nrows=585) > tail(report) Semaine headphones 580 2015-02-08 - 2015-02-14 53 581 2015-02-15 - 2015-02-21 52 582...

Read more »

Growing some Trees

March 18, 2015
By
Growing some Trees

Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features), > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + header=TRUE,sep=";") The default classification tree is > arbre = rpart(factor(PRONO)~.,data=MYOCARDE) > rpart.plot(arbre,type=4,extra=6) We can change the options here, such as the minimum number of observations, per node > arbre = rpart(factor(PRONO)~.,data=MYOCARDE, + control=rpart.control(minsplit=10)) > rpart.plot(arbre,type=4,extra=6) or...

Read more »

Some More Results on the Theory of Statistical Learning

March 8, 2015
By
Some More Results on the Theory of Statistical Learning

Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following, Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as for some loss function . Why is it complicated ? From the law of large...

Read more »

Some Intuition About the Theory of Statistical Learning

March 7, 2015
By
Some Intuition About the Theory of Statistical Learning

While I was working on the Theory of Statistical Learning, and the concept of consistency, I found the following popular graph (e.g. from  thoses slides, here in French) The curve below is the error on the training sample, as a function of the size of the training sample. Above, it is the error on a validation sample. Our learning...

Read more »