**YGC » R**, and kindly contributed to R-bloggers)

The first part of the Exercise 5.1 requires to implement a regularized version of linear regression.

Adding regularization parameter can prevent the problem of over-fitting when fitting a high-order polynomial.

Plot the data:

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 | x <- read.table("ex5Linx.dat") y <- read.table("ex5Liny.dat") x <- x[,1] y <- y[,1] require(ggplot2) d <- data.frame">data.frame(x=x,y=y) p <- ggplot(d, aes(x,y)) + geom_point(colour="red", size=3) |

I will fit a 5th order polynomial, the hypothesis is:

\( h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5 \)

The idea of regularization is to impose Occam's razor on the solution, by scaling down the \( \theta \) which will lead to the tiny contribution of the higher order features.

For that, the cost function was defined as:

\( J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2] \)

**Gradient Descent with regularization parameters:**

Firstly, I implemented a gradient descent algorithm to find the theta.

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | mapFeature <- function(x, degree=5) { return(sapply(0:degree, function(i) x^i)) } ## hypothesis function h <- function(theta, x) { #sapply(1:m, function(i) theta %*% x[i,]) toReturn <- x %*% t(theta) return(toReturn) } ## cost function J <- function(theta, x, y, lambda=1) { m <- length(y) r <- theta^2 r[1] <- 0 j <- 1/(2*m) * sum((h(theta, x)-y)^2) + lambda*sum(r) return(j) } gradDescent <- function(theta, x, y, alpha=0.1, niter=1000, lambda=1) { m <- length(y) for (i in 1:niter) { tt <- theta tt[1] <- 0 dj <- 1/m * (t(h(theta,x)-y) %*% x + lambda * tt) theta <- theta - alpha * dj } return(theta) } |

**fitting the data with gradient descent algorithm:**

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 | x <- mapFeature(x) theta <- matrix(rep(0,6), nrow=1) theta <- gradDescent(theta, x, y) x.test <- seq(-1,1, 0.001) y.test <- mapFeature(x.test) %*% t(theta) p+geom_line(aes(x=x.test, y=y.test), colour="blue") |

As shown above, the fitting model fits the data well.

**Normal Equation with regularization parameters:**

The Exercise requires implementing Normal Equation with the regularization parameters added.

that is:

\( \theta = (X^T X + \lambda \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix} )^{-1} (X^T y) \)

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 | ## normal equations normEq <- function(x,y, lambda) { n <- ncol(x) ## extra regularizatin terms r <- lambda * diag(n) r[1,1] <- 0 theta <- solve(t(x) %*% x + r) %*% t(x) %*% y return(theta) } |

I try 3 different lambda values to see how it influences the fit.

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 | lambda <- c(0,1,10) theta <- sapply(lambda, normEq, x=x, y=y) x.test <- seq(-1,1, 0.001) yy <- sapply(1:3, function(i) mapFeature(x.test) %*% theta[,i]) yy <- melt(yy) yy[,1] <- rep(x.test, 3) colnames(yy) <- c("X", expression(lambda), "Y") yy$lambda=factor(yy$lambda, labels=unique(lambda)) p+geom_line(data=yy,aes(X,Y, group=lambda, colour=lambda)) |

With lambda=0, the fit is very tight to the original points (the red line), and of course it is over-fitting.

As lambda increase, the model gets less tight and more generalized, and therefore preventing over-fitting.

This figure can also lead to a conclusion, that when lambda is too large, the model will under-fitting.

Reference:

Machine Learning Course

Exercise 5

#### Related Posts

**leave a comment**for the author, please follow the link and comment on his blog:

**YGC » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...