Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before my course on « big data and economics » at the university of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to some common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn’t it the case also for regression models, like generalized additive models (with splines) ? Do you really know what the algorithm is doing ? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it, in R ?

When I started working on academia, someone told me something like « if you really want to understand a theory, teach it ». And that has been my moto for more than 15 years. I wanted to add a second part to that statement: « if you really want to understand an algorithm, recode it ». So let’s try this… My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be

I will use two datasets to illustrate. The first one is inspired by the cover of « Foundations of Machine Learning » by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two – continuous – features)

x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
z = c(1,1,1,1,1,0,0,1,0,0)
df = data.frame(x1=x,x2=y,y=as.factor(z))
plot(x,y,pch=c(1,19)[1+z])

Here is some code to get a visualization of the prediction (here the probability to be a black point)

rmatrix_model = function(model){
u = seq(0,1,length=101)
p = function(x,y) predict(model,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
return(v)}
nice_graph=function(v){
u = seq(0,1,length=101)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10[c(1,10)],breaks=c(0,5,10)/10)
points(x,y,pch=19,cex=1.5,col="white")
points(x,y,pch=c(1,19)[1+z],cex=1.5)
}
reg = glm(y~x1+x2,data=df,family=binomial)
nice_graph(rmatrix_model(reg))


Note that colors are defined here as

clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b")

or with some nonlinear model

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1
y = myocarde\$PRONO
X = as.matrix(cbind(1,myocarde[,1:7]))

So far, I do not plan to talk (too much) on the choice of tunning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what’s going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a predict model.