**R-english – Freakonometrics**, and kindly contributed to R-bloggers)

In our data science course, this morning, we’ve use random forrest to improve prediction on the German Credit Dataset. The dataset is

> url="http://freakonometrics.free.fr/german_credit.csv" > credit=read.csv(url, header = TRUE, sep = ",")

Almost all variables are treated a numeric, but actually, most of them are factors,

> str(credit) 'data.frame': 1000 obs. of 21 variables: $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ... $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ... $ Duration : int 18 9 12 12 12 10 8 ... $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...

(etc). Let us convert categorical variables as factors,

> F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20) > for(i in F) credit[,i]=as.factor(credit[,i])

Let us now create our training/calibration and validation/testing datasets, with proportion 1/3-2/3

> i_test=sample(1:nrow(credit),size=333) > i_calibration=(1:nrow(credit))[-i_test]

The first model we can fit is a logistic regression, on selected covariates

> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Length.of.current.employment + Sex...Marital.Status, family=binomial, data = credit[i_calibration,])

Based on that model, it is possible to draw the ROC curve, and to compute the AUC (on ne validation dataset)

> fitLog <- predict(LogisticModel,type="response", + newdata=credit[i_test,]) > library(ROCR) > pred = prediction( fitLog, credit$Creditability[i_test]) > perf <- performance(pred, "tpr", "fpr") > plot(perf) > AUCLog1=performance(pred, measure = "auc")@y.values[[1]] > cat("AUC: ",AUCLog1,"n") AUC: 0.7340997

An alternative is to consider a logistic regression on all explanatory variables

> LogisticModel <- glm(Creditability ~ ., + family=binomial, + data = credit[i_calibration,])

We might overfit, here, and we should observe that on the ROC curve

> fitLog <- predict(LogisticModel,type="response", + newdata=credit[i_test,]) > pred = prediction( fitLog, credit$Creditability[i_test]) > perf <- performance(pred, "tpr", "fpr") > plot(perf) > AUCLog2=performance(pred, measure = "auc")@y.values[[1]] > cat("AUC: ",AUCLog2,"n") AUC: 0.7609792

There is a slight improvement here, compared with the previous model, where only five explanatory variables were considered.

Consider now some regression tree (on all covariates)

> library(rpart) > ArbreModel <- rpart(Creditability ~ ., + data = credit[i_calibration,])

We can visualize the tree using

> library(rpart.plot) > prp(ArbreModel,type=2,extra=1)

The ROC curve for that model is

> fitArbre <- predict(ArbreModel, + newdata=credit[i_test,], + type="prob")[,2] > pred = prediction( fitArbre, credit$Creditability[i_test]) > perf <- performance(pred, "tpr", "fpr") > plot(perf) > AUCArbre=performance(pred, measure = "auc")@y.values[[1]] > cat("AUC: ",AUCArbre,"n") AUC: 0.7100323

As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions.

> library(randomForest) > RF <- randomForest(Creditability ~ ., + data = credit[i_calibration,]) > fitForet <- predict(RF, + newdata=credit[i_test,], + type="prob")[,2] > pred = prediction( fitForet, credit$Creditability[i_test]) > perf <- performance(pred, "tpr", "fpr") > plot(perf) > AUCRF=performance(pred, measure = "auc")@y.values[[1]] > cat("AUC: ",AUCRF,"n") AUC: 0.7682367

Here this model is (slightly) better than the logistic regression. Actually, if we create many training/validation samples, and compare the AUC, we can observe that – on average – random forests perform better than logistic regressions,

> AUC=function(i){ + set.seed(i) + i_test=sample(1:nrow(credit),size=333) + i_calibration=(1:nrow(credit))[-i_test] + LogisticModel <- glm(Creditability ~ ., + family=binomial, + data = credit[i_calibration,]) + summary(LogisticModel) + fitLog <- predict(LogisticModel,type="response", + newdata=credit[i_test,]) + library(ROCR) + pred = prediction( fitLog, credit$Creditability[i_test]) + AUCLog2=performance(pred, measure = "auc")@y.values[[1]] + RF <- randomForest(Creditability ~ ., + data = credit[i_calibration,]) + fitForet <- predict(RF, + newdata=credit[i_test,], + type="prob")[,2] + pred = prediction( fitForet, credit$Creditability[i_test]) + AUCRF=performance(pred, measure = "auc")@y.values[[1]] + return(c(AUCLog2,AUCRF)) + } > A=Vectorize(AUC)(1:200) > plot(t(A))

**leave a comment**for the author, please follow the link and comment on their blog:

**R-english – Freakonometrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...