Machine Learning Ex 5.2 – Regularized Logistic Regression

October 25, 2011
By

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

Now we move on to the second part of the Exercise 5.2, which requires to implement regularized logistic regression using Newton's Method.

Plot the data:

x <- read.csv("ex5Logx.dat", header=F)
y <- read.csv("ex5Logy.dat", header=F)
y <- y[,1]

d <- data.frame(x1=x[,1],x2=x[,2],y=factor(y))
require(ggplot2)
p <- ggplot(d, aes(x=x1, y=x2))+
    geom_point(aes(colour=y, shape=y))



We will now fit a regularized regression model to this data.

The hypothesis function in logistic regression is :
\(h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }=P(y=1\vert x;\theta)\)

In this exercise, we will assign \(x\) , in the \(\theta^Tx\) , to be all monomials of \(u\) and \(v\) up to the sixth power:
\( x=\left[\begin{array}{c} 1\\ u\\ v\\ u^2\\ uv\\ v^2\\ u^3\\ \vdots\\ uv^5\\ v^6\end{array}\right] \)

where \(x_0 = 1, x_1=u, x_2= v,\ldots x_{28} =v^6\) .

I defined the function mapFeature, that maps the original inputs to the feature vector.

mapFeature <- function(u,v, degree=6) {
    out <- sapply(0:degree,function(i)
                  sapply(0:i, function(j)
                         u^(i-j) * v^j
                         )
                  )
    out <- <- unlist(out)
    return(out)
}

Regularized Logistic Regression:

The cost function of regularized logistic regression is defined as:
\( J(\theta)=-\frac{{1}}{m}\sum_{i=1}^{m}\left[ y^{(i)}\log(h_... ...\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2} \)

Notice that this function can work for regularized (lambda > 0) and unregularized (lambda = 0) logistic regression. The regularization term at the end will lead to a more tiny \(\theta\) , thus obtain a more generalized fit, which more likely will work better on new data (for doing predictions).

Newton's Method:

The Newton's Method update rule is:
\(\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J\)

In the regularized version of logistic regression, the gradient \(\nabla_{\theta}(J)\) and the Hessian \(H\) have different forms:

\(\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x + \frac{\lambda}{m} \theta\)

\(H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x] + \frac{\lambda}{m} \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix}\)

Also notice that, when lambda=0, you will see the same formulas as unregularized logistic regression.

Here is my implementation:

##sigmod function
g <- function(z) {
    toReturn <- 1/(1+exp(-z))
    return(toReturn)
}

##hypothesis function
h <- function(theta, x) {
    g(x %*% theta)
}

## cost function
J <- function(theta, x,y,lambda=1) {
    m <- length(y)
    j <- -1/m * (
                 y %*% log( h(theta,x) ) +
                 (1-y) %*% log( 1- h(theta,x) )
                 )

    r <- theta^2
    r[1] <- 0
    j <- j + lambda/(2*m) * sum(r)
    return(j)
}



## gradient
grad <- function(theta, x, y, lambda=1) {
    m <- length(y)
    r <- lambda/m * theta
    r[1] <- 0
    g <- 1/m * t(x) %*% (h(theta,x)-y) + r
    return(g)
}

## Hessian
Hessian <- function(theta, x, lambda=1) {
    m <- nrow(x)
    n <- ncol(x)
    r <- lambda/m * diag(n)
    r[1] <- 0
    H <- 1/m * t(x) %*% x *diag(h(theta,x)) * diag(1-h(theta,x)) + r
    return(H)
}

First, I calculate the theta, for lambda=1.

colnames(x) <- c("u", "v")
x <- mdply(x, mapFeature)
x <- x[,c(-1,-2)]
x <- as.matrix(x)

theta <- matrix(rep(0, ncol(x)), ncol=1)
lambda <- 1
j <- rep(0,10)
for (i in 1:10) {
    theta <- theta - solve(Hessian(theta,x, lambda)) %*% grad(theta,x,y,lambda)
    j[i] <- J(theta,x,y, lambda)
}

To validate the function is converging properly, We plot the values obtained from cost function against number of iterations.

ggplot()+
    aes(x=1:10,y=j)+
    geom_point(colour="red")+
    geom_line()+xlab("Iteration")+
    ylab("Cost J")


Converging fast.

Now, we make it iterate for lambda = 0 and lambda=10 for comparing the fitting models.

theta0 <- matrix(rep(0, ncol(x)), ncol=1)
for (i in 1:10) {
  theta0 <- theta0 - solve(Hessian(theta0,x, lambda=0)) %*% grad(theta0,x,y,lambda=0)
}

theta10 <- matrix(rep(0, ncol(x)), ncol=1)
for (i in 1:10) {
  theta10 <- theta10 - solve(Hessian(theta10,x, lambda=10)) %*% grad(theta10,x,y,lambda=10)
}

Finally calcuate the decision boundary line and visulize it.

u <- seq(-1,1.2, len=200)
v <- seq(-1,1.2, len=200)

z0 <- matrix(0, length(u), length(v))
z1 <- matrix(0, length(u), length(v))
z10 <- matrix(0, length(u), length(v))

for (i in 1:length(u)) {
    for (j in 1:length(v)) {
        features <- mapFeature(u[i],v[j])
        z0[i,j] <- features %*% theta0
        z1[i,j] <- features %*% theta
        z10[i,j] <- features %*% theta10
    }
}


rownames(z0) <- rownames(z1) <- rownames(z10) <- as.character(u)
colnames(z0) <- colnames(z1) <- colnames(z10) <- as.character(v)

z0.melted <- melt(z0)
z1.melted <- melt(z1)
z10.melted <- melt(z10)

z0.melted <- data.frame(z0.melted, lambda=0)
z1.melted <- data.frame(z1.melted, lambda=1)
z10.melted <- data.frame(z10.melted, lambda=10)

zz <- rbind(z0.melted, z1.melted, z10.melted)
zz$lambda <- factor(zz$lambda)
colnames(zz) <- c("u", "v", "z", "lambda")

p+geom_contour(data=zz, aes(x=u, y=v, z=z, group=lambda, colour=lambda),bins=1)


The red line (lambda=0) is more tightly fit to the crosses.
As lambda increase, the fit becomes more loose and more generalized.

PS:it's very weird that the legends in the above figure not shown properly.

Related Posts

To leave a comment for the author, please follow the link and comment on his blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.