Machine Learning Ex5.2 – Regularized Logistic Regression

March 20, 2011
By

(This article was first published on al3xandr3, and kindly contributed to R-bloggers)

Exercise 5.2 Improves the Logistic Regression implementation done in Exercise 4 by adding a regularization parameter that reduces the problem of over-fitting. We will be using Newton's Method.

Data

Here's the data we want to fit.

# linear regression
# load the data
mydata = read.csv("http://spreadsheets.google.com/pub?key=0AnypY27pPCJydHZPN2pFbkZGd1RKeU81OFY3ZHJldWc&output=csv", header = TRUE)

# plot the data
plot(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0],, xlab="u", ylab="v")
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright", c("y=0","y=1"), pch=c(1, 3), col=c("black", "blue"), bty="n")

http://al3xandr3.github.com/img/ml-ex52-data.png

The idea is to create a mathematical model, that separates the circles from the crosses in the plot above. And that will then allow to predict, for a new u and v value, the probability of it being a cross.

Theory

Hypothesis is:

[ h_theta(x) = g(theta^T x) = frac{1}{ 1 + e ^{- theta^T x} } ]

The idea of the regularization is to loosen up the tight fit, and obtain a more generalized fit. For that we define the cost function, with an added generalization parameter that blunts the fit, like so:

[ J(theta) = frac{1}{m} sum_{i=1}^m [(-y)log(h_theta(x)) - (1 - y) log(1- h_theta(x))] + frac{lambda}{2m} sum_{i=1}^n theta^2] ] (lambda) is called the regularization parameter.

The iterative (theta) updates using Newton's method is defined as:

[ theta^{(t+1)} = theta^{(t)} - H^{-1} nabla_{theta}J ]

And the gradient and Hessian are defined like so(in vectorized versions):

[ nabla_{theta}J = frac{1}{m} sum_{i=1}^m (h_theta(x) - y) x + frac{lambda}{m} theta ] [ H = frac{1}{m} sum_{i=1}^m [h_theta(x) (1 - h_theta(x)) x^T x] + frac{lambda}{m} begin{bmatrix} 0 & & & \ & 1 & & \ & & ... & \ & & & 1 end{bmatrix} ]

Implementation

Lest first define the functions above, with the added generalization parameter:

# sigmoid function
g = function (z) {
  return (1 / (1 + exp(-z) ))
} # plot(g(c(1,2,3,4,5,6)), type="l")

# build hight order feature vector
# for 2 features, for a given degree
hi.features = function (f1,f2,deg) {
  n = ncol(f1)
  ma = matrix(rep(1,length(f1)))
  for (i in 1:deg) {
    for (j in 0:i)    
      ma = cbind(ma, f1^(i-j) * f2^j)
  }
  return(ma)
} # hi.features(c(1,2), c(3,4),2)
# creates: 1 u v u^2 uv v^2 ...

h = function (x,th) {
  return(g(x %*% th))
} # h(x,th)

# derivative of J 
grad = function (x,y,th,m,la) {
  G = (la/m * th)
  G[1,] = 0
  return((1/m * t(x) %*% (h(x,th) - y)) +  G)
} # grad(x,y,th,m,la)

H = function (x,y,th,m,la) {
  n = length(th)
  L = la/m * diag(n)
  L[1,] = 0
  return((1/m * t(x) %*% x * diag(h(x,th)) * diag(1 - h(x,th))) + L)
} # H(x,y,th,m,la)

# cost function
J = function (x,y,th,m,la) {
  pt = th
  pt[1] = 0
  A = (la/(2*m))* t(pt) %*% pt
  return((1/m * sum(-y * log(h(x,th)) - (1 - y) * log(1 - h(x,th)))) + A)
} # J(x,y,th,m,la)

Now we can make it run, first for (lambda=1)

# setup variables
m = length(mydata$u) # samples
x = hi.features(mydata$u, mydata$v,6)
n = ncol(x) # features
y = matrix(mydata$y, ncol=1)

# lambda = 1
# use the cost function to check is works
th1 = matrix(0,n)
la = 1
jiter = array(0,c(15,1))
for (i in 1:15) {
  jiter[i] = J(x,y,th1,m,la)
  th1 = th1 - solve(H(x,y,th1,m,la)) %*% grad(x,y,th1,m,la) 
}

Validate that is converging properly, by plotting the Cost(J) function against the number of iterations.

# check that is converging correctly
plot(jiter, xlab="iterations", ylab="cost J")

http://al3xandr3.github.com/img/ml-ex52-j.png

Converging well and fast, as is typical from Newton's method.

And now we make it iterate for (lambda=0) and (lambda=10) for comparing fits later:

# lambda = 0
th0 = matrix(0,n)
la = 0
for (i in 1:15) {
  th0 = th0 - solve(H(x,y,th0,m,la)) %*% grad(x,y,th0,m,la) 
}

# lambda = 10
th10 = matrix(0,n)
la = 10
for (i in 1:15) {
  th10 = th10 - solve(H(x,y,th10,m,la)) %*% grad(x,y,th10,m,la) 
}

Finally calculate the Decision Boundary line and visualize it:

# calculate the decision boundary line
# create many points
u = seq(-1, 1.2, len=200);
v = seq(-1, 1.2, len=200);
z0 = matrix(0, length(u), length(v))
z1 = matrix(0, length(u), length(v))
z10 = matrix(0, length(u), length(v))
for (i in 1:length(u)) {
  for (j in 1:length(v)) {
    z0[j,i] =  hi.features(u[i],v[j],6) %*% th0
    z1[j,i] =  hi.features(u[i],v[j],6) %*% th1
    z10[j,i] =  hi.features(u[i],v[j],6) %*% th10
  }
}

# plots
contour(u,v,z0,nlev = 0, xlab="u", ylab="v", nlevels=0, col="black",lty=2)
contour(u,v,z1,nlev = 0, xlab="u", ylab="v", nlevels=0, col="red",lty=2, add=TRUE)
contour(u,v,z10,nlev = 0, xlab="u", ylab="v", nlevels=0, col="green3",lty=2, add=TRUE)
points(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0])
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright",  c(expression(lambda==0), expression(lambda==1),expression(lambda==10)), lty=1, col=c("black", "red","green3"),bty="n" )

http://al3xandr3.github.com/img/ml-ex52-fit.png

See that the black line ((lambda=0)) is the more tightly fit to the crosses, and as we increase the lambda values it becomes more loose(and more generalized).

To leave a comment for the author, please follow the link and comment on his blog: al3xandr3.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.