Machine Learning Ex3 – multivariate linear regression

Posted on March 8, 2011 by al3xandr3 in R bloggers | 0 Comments

[This article was first published on al3xandr3, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exercise 3 is about multivariate linear regression. First part is about finding a good learning rate (alpha) and 2nd part is about implementing linear regression using normal equations instead of the gradient descent algorithm.

Data

As usual hosted in google docs:

mydata = read.csv("http://spreadsheets.google.com/pub?key=0AnypY27pPCJydExfUzdtVXZuUWphM19vdVBidnFFSWc&output=csv", header = TRUE)

# show last 5 rows
tail(mydata, 5)


   area bedrooms  price
43 2567        4 314000
44 1200        3 299000
45  852        2 179900
46 1852        4 299900
47 1203        3 239500

Feature Scaling

When applying the gradient descent, we need to make sure that features values are in the same order of magnitudes, otherwise it will not converge well, so here’s a helper function to scale features:

# given a data frame and the column names i want to scale
# creates new columns: feature.scale = (feature - mean)/std
feature.scale = function (dta, cols) {
  for (col in cols) {
    sigma = sd(dta[col])
    mu = mean(dta[col])
    dta[paste(names(dta[col]), ".scale", sep = "")] = (dta[col] - mu)/sigma
  }
  return(dta)
}

dta = feature.scale(mydata, c("area", "bedrooms"))
tail(dta, 5)


   area bedrooms  price area.scale bedrooms.scale
43 2567        4 314000   0.712618      1.0904165
44 1200        3 299000  -1.007523     -0.2236752
45  852        2 179900  -1.445423     -1.5377669
46 1852        4 299900  -0.187090      1.0904165
47 1203        3 239500  -1.003748     -0.2236752

Finding Alpha((alpha))

Recall from ex2 that the gradient descent equation for the updates of theta is:

[ theta := theta – alpha frac{1}{m} x^T (xtheta^T – y) ]

For finding a good alpha((alpha)) we will use a trial and error approach. The idea is look at how the Cost value (J(alpha)) drops with the number of iterations, the fastest the drop the better, but if goes up then the alpha value is already too large.

The Cost is given by(in vectorized form):

[ J(theta) = frac{1}{2m} (Xtheta – y)^T (Xtheta – y) ]

See the lessons on details how to reach that equation.

Implementing:

# lets try out a few alpha's
alpha = c(0.03, 0.1, 0.3, 1, 1.3, 2)

# store the J values over the iterations
J = array(0,c(50,length(alpha)))
m = length(dta$price)
theta = matrix(c(0,0,0), nrow=1)
x = matrix(c(rep(1,m), dta$area.scale, dta$bedrooms.scale), ncol=3)
y = matrix(dta$price, ncol=1)

# the delta updates
delta = function(x,y,th) {
  delta = (t(x) %*% ((x %*% t(th)) - y))
  return(t(delta))
}

# the cost for a given theta
cost = function(x,y,th,m) {
  prt = ((x %*% t(th)) - y)
  return(1/m * (t(prt) %*% prt))
}

# run J for 50x, on each alpha
for (j in 1:length(alpha)) {
  for (i in 1:50) {
    J[i,j] = cost(x,y,theta,m) # capture the Cost
    theta = theta - alpha[j] * 1/m * delta(x,y,theta)
  }
}

# lets have a look
par(mfrow=c(3,2))
for (j in 1:length(alpha)) {
  plot(J[,j], type="l", xlab=paste("alpha", alpha[j]), ylab=expression(J(theta)))
}

alpha 1 seems to be the best.

Setting (alpha=1) and running until convergence:

# running until convergence
for (i in 1:50000) {
  theta = theta - 1 * 1/m * delta(x,y,theta)
  if (abs(delta(x,y,theta)[2]) < 0.0000001) {  
    break # to interrupt updates
  }
}

# 1. The final values of theta
print("Theta:")
print(theta)

# 2. The predicted price of a house with 1650 square feet and 3 bedrooms.
# Don't forget to scale your features when you make this prediction!
print("Prediction for a house with 1650 square feet and 3 bedrooms:")
print(theta %*% c(1, (1650 - mean(dta["area"]))/sd(dta["area"]), (3 - mean(dta["bedrooms"]))/sd(dta["bedrooms"])))


[1] "Theta:"
         [,1]     [,2]      [,3]
[1,] 340412.7 110631.1 -6649.474
[1] "Prediction for a house with 1650 square feet and 3 bedrooms:"
         [,1]
[1,] 293081.5

Normal Equations

Given the cost function:

[ J(theta) = frac{1}{2m} sum_{i=1}^m (h_theta(x^{(i)}) - y^{(i)})^2 ]

Recall that this function returns how big is the error of our model vs the data. Thus our goal is to minimize it. And in order to find its minimum there is also a more direct approach (instead of using gradient descent) we can just calculate its derivative set it to 0 and find the value of theta:

[ frac{delta}{delta theta_j} J(theta_j) = 0 ]

thats for (theta_j). We need of course to account for every j.

If we write it down into matrix notation, calculate its derivatives and set it to 0, then the value of theta will be obtained with:

[ theta = (X^T X)^{-1} (X^T y) ]

That can be easily implemented like so:

x = matrix(c(rep(1,m), mydata$area, mydata$bedrooms), ncol=3)
y = matrix(mydata$price, ncol=1)
theta.normal = solve(t(x) %*% x) %*% (t(x) %*% y)

# 1. In your program, use the formula above to calculate. Remember that
# while you don't need to scale your features, you still need to add 
# an intercept term.
print("Theta:")
print(theta.normal)

# 2. Once you have found  from this method, use it to make a price prediction 
# for a 1650-square-foot house with 3 bedrooms. Did you get the same price 
# that you found through gradient descent?
print("Price prediction for a 1650-square-foot house with 3 bedrooms")
t(theta.normal) %*%  c(1, 1650, 3)


[1] "Theta:"
           [,1]
[1,] 89597.9095
[2,]   139.2107
[3,] -8738.0191
[1] "Price prediction for a 1650-square-foot house with 3 bedrooms"
         [,1]
[1,] 293081.5

Normal equations are more direct but also more costly than gradient descent to run, so depending on situation you might need to choose one or the other.

References

To leave a comment for the author, please follow the link and comment on their blog: al3xandr3.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Machine Learning Ex3 – multivariate linear regression

Data

Feature Scaling

Finding Alpha((alpha))

Normal Equations

References

Related

Data

Feature Scaling

Finding Alpha((alpha))

Normal Equations

References

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)