Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Andrew Ng has posted introductory machine learning lessons on the OpenClassRoom site. I’ve watched the first set and will here solve Exercise 2.

The exercise is to build a linear regression implementation, I’ll use R.

The point of linear regression is to come up with a mathematical function(model) that represents the data as best as possible, that is done by fitting a straight line to the observed data. This model will then allow us to make predictions on new data.

For example, the data we use here are boys ages and their corresponding heights, so when we get the mathematical model we will be able to guess the boys height from his age.

## Data

# load the data

# include ggplot2
library(ggplot2)

ex2plot = ggplot(mydata, aes(x, y)) + geom_point() +
ylab('Height in meters') +
xlab('Age in years') ## Theory

The model we will get at the end is a line that fits the data, is defined like so:

Assuming (x_0 = 1):

[ h_theta(x) = theta_0 x_0 + theta_1 x_1 + theta_2 x_2 + … ]

That can be summarized by (last is matrix notation):

[ h_theta(x) = sum_{i=0}^n theta_i x_i = theta^T x ]

Matrix representation is useful because has good support in software tools.

Goal is to get the line closest to observed data points as possible, thus we can define a cost function that returns the difference of the real data vs myModel:

[ J(theta) = frac{1}{2m} sum_{i=1}^m (h_theta(x^{(i)}) – y^{(i)})^2 ]

where (i) is each data example we have and (m) is their total.

With J we now have a metric to check if the hypotheses line is getting closer to data points or not.

Next step is to find the smaller cost as possible from J, and in fact thats exactly what the gradient descent algorithm does: starting with an inicial guess it iterates to smaller and smaller values of a given function by following the direction of the derivative:

[ x_i := x_{i-1} – epsilon f^’ (x_{i-1}) ]

Applying to our J:

[ theta_j := theta_j – alpha frac{delta}{delta theta_j} J(theta) ]

And doing a bit of calculus on derivatives we get:

[ theta_j := theta_j – alpha frac{1}{m} sum_{i=1}^m (h_theta(x^{(i)}) – y^{(i)}) x^{(i)} ]

Where (alpha) defines the size of steps of the convergence to (theta).

Now lets check if all this math really works.

## Implementation – take 1

alpha = 0.07
m = length(mydata$x) theta = c(0,0) x = mydata$x
y = mydata$y delta = function(x,y,th,m) { sum = 0 for (i in 1:m) { sum = sum + (((t(th) %*% c(1,x[i])) - y[i]) * c(1,x[i])) } return (sum) } # 1 iteration theta - alpha * 1/m * delta(x,y,theta,m)  0.07452802 0.38002167  ## Implementation – take 2 After having a peek at the Matlab solution, i learned that is possible to replace the sum in the equation with a transpose matrix multiplication(like done with the line equation): [ theta := theta – alpha frac{1}{m} x^T (xtheta^T – y) ] So we can get a full matrix implementation: alpha = 0.07 m = length(mydata$x)
theta = matrix(c(0,0), nrow=1)
x = matrix(c(rep(1,m), mydata$x), ncol=2) y = matrix(mydata$y, ncol=1)
delta = function(x,y,th) {
delta = (t(x) %*% ((x %*% t(th)) - y))
return(t(delta))
}

# 1 iteration
theta - alpha * 1/m * delta(x,y,theta)

[,1]      [,2]
[1,] 0.07452802 0.3800217


## The Model

First we run several iterations, until convergence:

for (i in 1:1500) {
theta = theta - alpha * 1/m * delta(x,y,theta)
}
theta

[,1]       [,2]
[1,] 0.7501504 0.06388338


And finally we see how well the line(model) fits the data:

ex2plot + geom_abline(intercept=theta, slope=theta) 