**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

In order to illustrate the problem of chosing a classification model consider some simulated data,

> n = 500 > set.seed(1) > X = rnorm(n) > ma = 10-(X+1.5)^2*2 > mb = -10+(X-1.5)^2*2 > M = cbind(ma,mb) > set.seed(1) > Z = sample(1:2,size=n,replace=TRUE) > Y = ma*(Z==1)+mb*(Z==2)+rnorm(n)*5 > df = data.frame(Z=as.factor(Z),X,Y)

A first strategy is to split the dataset in two parts, a *training *dataset, and a *testing *dataset.

> df1 = training = df[1:300,] > df2 = testing = df[301:500,]

**The Holdout Method: Training and Testing Datasets**

The two datasets can be visualised below, with the training dataset on top, and the testing dataset below

> plot(df1$X,df1$Y,pch=19,col=c(rgb(1,0,0,.4), + rgb(0,0,1,.4))[df1$Z])

We can consider a simple **classification tree.**

> library(rpart) > fit = rpart(Z~X+Y,data=df1) > pred = function(x,y) predict(fit,newdata=data.frame(X=x,Y=y))[,1]

To visualize it, use

> vx=seq(-3,3,length=101) > vy=seq(-25,25,length=101) > z=matrix(NA,length(vx),length(vy)) > for(i in 1:length(vx)){ + for(j in 1:length(vy)) + {z[i,j]=pred(vx[i],vy[j])}}

and

> image(vx,vy,z,axes=FALSE,xlab="",ylab="") > points(df1$X,df1$Y,pch=19,col=c(rgb(1,0,0,.4), + rgb(0,0,1,.4))[df1$Z])

We have on top the prediction on the training dataset, then below the prediction on the testing dataset. And finally on the bottom, the ROC curve for the testing dataset (black curve) as well as the training dataset (grey curve)

> Y1=as.numeric(df1$Z)-1 > Y2=as.numeric(df2$Z)-1 > library(ROCR) > S1 = predict(fit,newdata=df1)[,1] > S2 = predict(fit,newdata=df2)[,1] > pred <- prediction( S2, Y2 ) > perf <- performance( pred, "tpr", "fpr" ) > plot( perf ) > pred <- prediction( S1, Y1 ) > perf <- performance( pred, "tpr", "fpr" ) > plot( perf ,add=TRUE,col="grey")

From that tree, it is natural to get a prediction using a **random forest**

> library(randomForest) > fit=randomForest(Z~X+Y,data=df1)

To get a prediction, here, use

> pred=function(x,y) + predict(fit,newdata=data.frame(X=x,Y=y), + type="prob")[,2]

Not that the fit is perfect on the training sample. Another popoular model might be the **logistic regression**

> fit=glm(Z~X+Y,data=df1,family=binomial) > pred=function(x,y) + predict(fit,newdata=data.frame(X=x,Y=y), + type="response")

The later is very close to the **linear discriminent analysis**

> library(MASS) > fit=lda(Z~X+Y,data=df1,family=binomial) > pred=function(x,y) + predict(fit,newdata= + data.frame(X=x,Y=y))$posterior[,2]

or a **quadratic discriminent analysis**

> fit=qda(Z~X+Y,data=df1,family=binomial) > pred=function(x,y) + predict(fit,newdata= + data.frame(X=x,Y=y))$posterior[,2]

To try very different models, consider a ** k-nearest neighbor
**(here with 9 neighbors)

> library(caret) > fit=knn3(Z~X+Y,data=df1,k=9) > pred=function(x,y) + predict(fit,newdata=data.frame(X=x,Y=y))[,2]

some **logistic regression with bivarriate splines**

> library(mgcv) > fit=gam(Z~s(X,Y),data=df1,family=binomial) > pred=function(x,y) + predict(fit,newdata=data.frame(X=x,Y=y), + type="response")

or some **gradient boosting algorithm**

> library(dismo) > df1$Z01 = 1*(df1$Z=="2") > fit=gbm.step(data=df1, gbm.x = 2:3, gbm.y = 4, + family = "bernoulli", tree.complexity = 5, + learning.rate = 0.01, bag.fraction = 0.5) > pred = function(x,y) + predict(fit,newdata=data.frame(X=x,Y=y), + type="response",n.trees=400)

Nevertheless, the use of this holdout method to create some training and testing datasets can be difficult. It can be done only when we have a lot of observations. And even so, we can still be unluckly when creating the testing sample (simply because of bad luck). An alternative is to use cross validation.

**Using Cross-Validation**

Here, we consider one-leave-out cross-validation (but *v*-fold cross validation can also be consider, to get results faster).

Consider here a **classification tree**. To derive the cross validation ROC curve, for instance, use

> FIT=list() > for(i in 1:n) + FIT[[i]] = rpart(Z~X+Y,data=df[-i,]) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,])[,2] > S = Vectorize(predict_i)(1:n)

Here, I store the *n* models, where each time, one observation is removed, and then I get a prediction for the removed observation. And we can generate the ROC curve

> Y = as.numeric(df$Z)-1 > library(ROCR) > pred = prediction( S, Y ) > perf = performance( pred, "tpr", "fpr" ) > plot( perf )

Of course, one can easily get a code that runs faster, but that one is working on that small dataset.

If we consider a **random forest**, we get

> FIT=list() > for(i in 1:n) + FIT[[i]] = randomForest(Z~X+Y,data=df[-i,]) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,], + type="prob")[,2] > S = Vectorize(predict_i)(1:n)

One can consider a **logistic regression**

> FIT=list() > for(i in 1:n) + FIT[[i]] = glm(Z~X+Y,data=df[-i,], + family=binomial) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,], + type="response") > S = Vectorize(predict_i)(1:n)

or a **linear discriminent analysis**

> FIT=list() > for(i in 1:n) + FIT[[i]] = lda(Z~X+Y,data=df[-i,], + family=binomial) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,])$posterior[,2] > S = Vectorize(predict_i)(1:n)

while a **quadratic discriminent analysis** yields

> FIT=list() > for(i in 1:n) + FIT[[i]] = qda(Z~X+Y,data=df[-i,], + family=binomial) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,])$posterior[,2] > S = Vectorize(predict_i)(1:n)

Again, if we consider another type of model, such as a ** k-nearest neighbor** (with

*k*=5) we get

> FIT=list() > for(i in 1:n) + FIT[[i]] = knn3(Z~X+Y,data=df[-i,],k=5) > predict_i = function(i) + predict(FIT[[i]],newdata=df[i,])[,2] > S = Vectorize(predict_i)(1:n)

or a *k* nearest neighbor with *k*=9

or *k*=21

One can also consider a **logistic regression with bivariate splines**

or, last but not least, some **gradient boosting model**. On that one, we might have trouble storing 500 output from the gradient boosting function (the output is large). A faster code might be

> VS = rep(NA,n) > for(i in 1:n){ + FIT = gbm.step(data=df[-i,], + gbm.x = 2:3, gbm.y = 4, family = "bernoulli", + tree.complexity = 5, learning.rate = 0.01, + bag.fraction = 0.5) + VS[i] = predict(FIT,newdata=df[i,], + n.trees=400) + }

Actually, it is now possible to compare all those models, with all ROC curves on a single graph

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...