# Machine Learning Ex 5.1 – Regularized Linear Regression

October 25, 2011
By

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

The first part of the Exercise 5.1 requires to implement a regularized version of linear regression.

Adding regularization parameter can prevent the problem of over-fitting when fitting a high-order polynomial.

Plot the data:

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9  x <- read.table("ex5Linx.dat") y <- read.table("ex5Liny.dat")   x <- x[,1] y <- y[,1]   require(ggplot2) d <- data.frame">data.frame(x=x,y=y) p <- ggplot(d, aes(x,y)) + geom_point(colour="red", size=3)

I will fit a 5th order polynomial, the hypothesis is:
$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5$

The idea of regularization is to impose Occam's razor on the solution, by scaling down the $\theta$ which will lead to the tiny contribution of the higher order features.

For that, the cost function was defined as:
$J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2]$

Firstly, I implemented a gradient descent algorithm to find the theta.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  mapFeature <- function(x, degree=5) { return(sapply(0:degree, function(i) x^i)) }   ## hypothesis function h <- function(theta, x) { #sapply(1:m, function(i) theta %*% x[i,]) toReturn <- x %*% t(theta) return(toReturn) }   ## cost function J <- function(theta, x, y, lambda=1) { m <- length(y) r <- theta^2 r[1] <- 0 j <- 1/(2*m) * sum((h(theta, x)-y)^2) + lambda*sum(r) return(j) }   gradDescent <- function(theta, x, y, alpha=0.1, niter=1000, lambda=1) { m <- length(y) for (i in 1:niter) { tt <- theta tt[1] <- 0 dj <- 1/m * (t(h(theta,x)-y) %*% x + lambda * tt) theta <- theta - alpha * dj } return(theta) }

fitting the data with gradient descent algorithm:

?View Code RSPLUS
 1 2 3 4 5 6 7 8  x <- mapFeature(x) theta <- matrix(rep(0,6), nrow=1) theta <- gradDescent(theta, x, y)   x.test <- seq(-1,1, 0.001) y.test <- mapFeature(x.test) %*% t(theta)   p+geom_line(aes(x=x.test, y=y.test), colour="blue")

As shown above, the fitting model fits the data well.

Normal Equation with regularization parameters:
The Exercise requires implementing Normal Equation with the regularization parameters added.
that is:
$\theta = (X^T X + \lambda \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix} )^{-1} (X^T y)$

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9  ## normal equations normEq <- function(x,y, lambda) { n <- ncol(x) ## extra regularizatin terms r <- lambda * diag(n) r[1,1] <- 0 theta <- solve(t(x) %*% x + r) %*% t(x) %*% y return(theta) }

I try 3 different lambda values to see how it influences the fit.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9  lambda <- c(0,1,10) theta <- sapply(lambda, normEq, x=x, y=y) x.test <- seq(-1,1, 0.001) yy <- sapply(1:3, function(i) mapFeature(x.test) %*% theta[,i]) yy <- melt(yy) yy[,1] <- rep(x.test, 3) colnames(yy) <- c("X", expression(lambda), "Y") yy$lambda=factor(yy$lambda, labels=unique(lambda)) p+geom_line(data=yy,aes(X,Y, group=lambda, colour=lambda))

With lambda=0, the fit is very tight to the original points (the red line), and of course it is over-fitting.

As lambda increase, the model gets less tight and more generalized, and therefore preventing over-fitting.

This figure can also lead to a conclusion, that when lambda is too large, the model will under-fitting.

Reference:
Machine Learning Course
Exercise 5