**R-english – Freakonometrics**, and kindly contributed to R-bloggers)

Before my course on « big data and economics » at the university of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to some common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn’t it the case also for regression models, like generalized additive models (with splines) ? Do you really know what the algorithm is doing ? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it, in R ?

When I started working on academia, someone told me something like « **if you really want to understand a theory, teach it** ». And that has been my moto for more than 15 years. I wanted to add a second part to that statement: « **if you really want to understand an algorithm, recode it **». So let’s try this… My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be

- the logistic regression
- the logistic regression with splines
- the logistic regression with kernels (and knn)
- the penalized logistic regression, ridge
- the penalized logistic regression, lasso
- the heuristics of neural nets
- an introduction to SVM
- classification trees
- bagging and random forests
- gradient boosting (and adaboost)

I will use two datasets to illustrate. The first one is inspired by the cover of « Foundations of Machine Learning » by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two – continuous – features)

1 2 3 4 5 | x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) z = c(1,1,1,1,1,0,0,1,0,0) df = data.frame(x1=x,x2=y,y=as.factor(z)) plot(x,y,pch=c(1,19)[1+z]) |

Here is some code to get a visualization of the prediction (here the probability to be a black point)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | rmatrix_model = function(model){ u = seq(0,1,length=101) p = function(x,y) predict(model,newdata=data.frame(x1=x,x2=y),type="response") v = outer(u,u,p) return(v)} nice_graph=function(v){ u = seq(0,1,length=101) image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10[c(1,10)],breaks=c(0,5,10)/10) points(x,y,pch=19,cex=1.5,col="white") points(x,y,pch=c(1,19)[1+z],cex=1.5) contour(u,u,v,levels = .5,add=TRUE) } reg = glm(y~x1+x2,data=df,family=binomial) nice_graph(rmatrix_model(reg)) |

Note that colors are defined here as

1 | clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b") |

or with some nonlinear model

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

1 2 3 4 | myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";") myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1 y = myocarde$PRONO X = as.matrix(cbind(1,myocarde[,1:7])) |

So far, I do not plan to talk (too much) on the choice of tunning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what’s going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a predict model.

**leave a comment**for the author, please follow the link and comment on their blog:

**R-english – Freakonometrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...