Blog Archives

Growing some Trees

March 18, 2015
By
Growing some Trees

Consider here the dataset used in a previous post, about visualising a classification (with more than 2 features), > MYOCARDE=read.table( + "http://freakonometrics.free.fr/saporta.csv", + header=TRUE,sep=";") The default classification tree is > arbre = rpart(factor(PRONO)~.,data=MYOCARDE) > rpart.plot(arbre,type=4,extra=6) We can change the options here, such as the minimum number of observations, per node > arbre = rpart(factor(PRONO)~.,data=MYOCARDE, + control=rpart.control(minsplit=10)) > rpart.plot(arbre,type=4,extra=6) or...

Read more »

Some More Results on the Theory of Statistical Learning

March 8, 2015
By
Some More Results on the Theory of Statistical Learning

Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following, Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as for some loss function . Why is it complicated ? From the law of large...

Read more »

Some Intuition About the Theory of Statistical Learning

March 7, 2015
By
Some Intuition About the Theory of Statistical Learning

While I was working on the Theory of Statistical Learning, and the concept of consistency, I found the following popular graph (e.g. from  thoses slides, here in French) The curve below is the error on the training sample, as a function of the size of the training sample. Above, it is the error on a validation sample. Our learning...

Read more »

Visualising a Classification in High Dimension

March 6, 2015
By
Visualising a Classification in High Dimension

So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it starts to be more complicated...

Read more »

Supervised Classification, beyond the logistic

March 5, 2015
By
Supervised Classification, beyond the logistic

In our data-science class, after discussing limitations of the logistic regression, e.g. the fact that the decision boundary line was a straight line, we’ve mentioned possible natural extensions. Let us consider our (now) standard dataset clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) z <- c(1,1,1,1,1,0,0,1,0,0) df <- data.frame(x,y,z) plot(x,y,pch=19,cex=2,col=clr1) One can consider a quadratic...

Read more »

Supervised Classification, discriminant analysis

March 3, 2015
By
Supervised Classification, discriminant analysis

Another popular technique for classification (or at least, which used to be popular) is the (linear) discriminant analysis, introduced by Ronald Fisher in 1936. Consider the same dataset as in our previous post > clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) > x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) > z <- c(1,1,1,1,1,0,0,1,0,0) > df <- data.frame(x,y,z) > plot(x,y,pch=19,cex=2,col=clr1) The main interest of...

Read more »

Supervised Classification, Logistic and Multinomial

March 2, 2015
By
Supervised Classification, Logistic and Multinomial

We will start, in our Data Science course,  to discuss classification techniques (in the context of supervised models). Consider the following case, with 10 points, and two classes (red and blue) > clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1)) > clr2 <- c(rgb(1,0,0,.2),rgb(0,0,1,.2)) > x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) > y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) > z <- c(1,1,1,1,1,0,0,1,0,0) > df <- data.frame(x,y,z) > plot(x,y,pch=19,cex=2,col=clr1) To get...

Read more »

John Snow, and Google Maps

February 27, 2015
By
John Snow, and Google Maps

In my previous post, I discussed how to use OpenStreetMaps (and standard plotting functions of R) to visualize John Snow’s dataset. But it is also possible to use Google Maps (and ggplot2 types of graphs). library(ggmap) get_london <- get_map(c(-.137,51.513), zoom=17) london <- ggmap(get_london) Again, the tricky part comes from the fact that the coordinate representation system, here, is not...

Read more »

John Snow, and OpenStreetMap

February 27, 2015
By
John Snow, and OpenStreetMap

While I was working for a training on data visualization, I wanted to get a nice visual for John Snow’s cholera dataset. This dataset can actually be found in a great package of famous historical datasets. library(HistData) data(Snow.deaths) data(Snow.streets) One can easily visualize the deaths, on a simplified map, with the streets (here simple grey segments, see Vincent Arel-Bundock’s...

Read more »

Visualizing Clusters

February 24, 2015
By
Visualizing Clusters

Consider the following dataset, with (only) ten points x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) plot(x,y,pch=19,cex=2) We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities. Since the number of observations is rather small, it is actually possible to get an exhaustive list of all partitions, and to minimize some criteria, such...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)