Classification from scratch, overview 0/8

May 29, 2018
By

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Before my course on « big data and economics » at the university of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to some common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn’t it the case also for regression models, like generalized additive models (with splines) ? Do you really know what the algorithm is doing ? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it, in R ?

When I started working on academia, someone told me something like « if you really want to understand a theory, teach it ». And that has been my moto for more than 15 years. I wanted to add a second part to that statement: « if you really want to understand an algorithm, recode it ». So let’s try this… My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be

I will use two datasets to illustrate. The first one is inspired by the cover of « Foundations of Machine Learning » by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two – continuous – features)

1
2
3
4
5
x = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
y = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
z = c(1,1,1,1,1,0,0,1,0,0)
df = data.frame(x1=x,x2=y,y=as.factor(z))
plot(x,y,pch=c(1,19)[1+z])

Here is some code to get a visualization of the prediction (here the probability to be a black point)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
rmatrix_model = function(model){
u = seq(0,1,length=101)
p = function(x,y) predict(model,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
return(v)}
nice_graph=function(v){
u = seq(0,1,length=101)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10[c(1,10)],breaks=c(0,5,10)/10)
points(x,y,pch=19,cex=1.5,col="white")
points(x,y,pch=c(1,19)[1+z],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)
}
reg = glm(y~x1+x2,data=df,family=binomial)
nice_graph(rmatrix_model(reg))

Note that colors are defined here as

1
clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b")

or with some nonlinear model

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

1
2
3
4
myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1
y = myocarde$PRONO
X = as.matrix(cbind(1,myocarde[,1:7]))

So far, I do not plan to talk (too much) on the choice of tunning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what’s going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a predict model.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)