Blog Archives

Another Interactive Map for the Cholera Dataset

March 31, 2015
By
Another Interactive Map for the Cholera Dataset

Following my previous post, François (aka @FrancoisKeck) posted a comment mentionning another package I could use to get an interactive map, the rleafmap package. And the heatmap was here easy to include. This time, we do not use openstreetmap. The first part is still the same, to get the data, > require(rleafmap) > library(sp) > library(rgdal) > library(maptools) >...

Read more »

Interactive Maps for John Snow’s Cholera Data

March 28, 2015
By
Interactive Maps for John Snow’s Cholera Data

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the  leaflet package, devtools::install_github("rstudio/leaflet") require(leaflet) To see what can be done with that package, we will use one more time the John Snow’s cholera dataset, discussed in previous...

Read more »

Spliting a Node in a Tree

March 23, 2015
By
Spliting a Node in a Tree

If we grow a tree with standard functions in R, on the same dataset used to introduce classification tree in some previous post, > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + head=TRUE,sep=";") > library(rpart) > cart<-rpart(PRONO~.,data=MYOCARDE) we get > library(rpart.plot) > library(rattle) > prp(cart,type=2,extra=1) The first step is to split the first node (based on the whole dataset). To split it, we...

Read more »

Regression Models, It’s Not Only About Interpretation

March 22, 2015
By
Regression Models, It’s Not Only About Interpretation

Yesterday, I did upload a post where I tried to show that “standard” regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess something related to...

Read more »

Forecast, Automatic Routines vs. Experience

March 18, 2015
By
Forecast, Automatic Routines vs. Experience

This morning, in our Time Series course, we’ve been playing with some data I got from google.ca/trends/. Actually, we’ve been playing on some old version, downloaded 18 months ago (discussed in a previous post, in French). > urls = "http://freakonometrics.free.fr/report-headphones-2015.csv" > report=read.table( + urls,skip=4,header=TRUE,sep=",",nrows=585) > tail(report) Semaine headphones 580 2015-02-08 - 2015-02-14 53 581 2015-02-15 - 2015-02-21 52 582...

Read more »

Growing some Trees

March 18, 2015
By
Growing some Trees

Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features), > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + header=TRUE,sep=";") The default classification tree is > arbre = rpart(factor(PRONO)~.,data=MYOCARDE) > rpart.plot(arbre,type=4,extra=6) We can change the options here, such as the minimum number of observations, per node > arbre = rpart(factor(PRONO)~.,data=MYOCARDE, + control=rpart.control(minsplit=10)) > rpart.plot(arbre,type=4,extra=6) or...

Read more »

Some More Results on the Theory of Statistical Learning

March 8, 2015
By
Some More Results on the Theory of Statistical Learning

Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following, Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as for some loss function . Why is it complicated ? From the law of large...

Read more »

Some Intuition About the Theory of Statistical Learning

March 7, 2015
By
Some Intuition About the Theory of Statistical Learning

While I was working on the Theory of Statistical Learning, and the concept of consistency, I found the following popular graph (e.g. from  thoses slides, here in French) The curve below is the error on the training sample, as a function of the size of the training sample. Above, it is the error on a validation sample. Our learning...

Read more »

Visualising a Classification in High Dimension

March 6, 2015
By
Visualising a Classification in High Dimension

So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it starts to be more complicated...

Read more »

Supervised Classification, beyond the logistic

March 5, 2015
By
Supervised Classification, beyond the logistic

In our data-science class, after discussing limitations of the logistic regression, e.g. the fact that the decision boundary line was a straight line, we’ve mentioned possible natural extensions. Let us consider our (now) standard dataset clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) z <- c(1,1,1,1,1,0,0,1,0,0) df <- data.frame(x,y,z) plot(x,y,pch=19,cex=2,col=clr1) One can consider a quadratic...

Read more »