# Machine Learning Ex3 – multivariate linear regression

March 8, 2011
By

(This article was first published on al3xandr3, and kindly contributed to R-bloggers)

Exercise 3 is about multivariate linear regression. First part is about finding a good learning rate (alpha) and 2nd part is about implementing linear regression using normal equations instead of the gradient descent algorithm.

## Data

As usual hosted in google docs:

mydata = read.csv("http://spreadsheets.google.com/pub?key=0AnypY27pPCJydExfUzdtVXZuUWphM19vdVBidnFFSWc&output=csv", header = TRUE)

# show last 5 rows
tail(mydata, 5)

   area bedrooms  price
43 2567        4 314000
44 1200        3 299000
45  852        2 179900
46 1852        4 299900
47 1203        3 239500


### Feature Scaling

When applying the gradient descent, we need to make sure that features values are in the same order of magnitudes, otherwise it will not converge well, so here’s a helper function to scale features:

# given a data frame and the column names i want to scale
# creates new columns: feature.scale = (feature - mean)/std
feature.scale = function (dta, cols) {
for (col in cols) {
sigma = sd(dta[col])
mu = mean(dta[col])
dta[paste(names(dta[col]), ".scale", sep = "")] = (dta[col] - mu)/sigma
}
return(dta)
}

dta = feature.scale(mydata, c("area", "bedrooms"))
tail(dta, 5)

   area bedrooms  price area.scale bedrooms.scale
43 2567        4 314000   0.712618      1.0904165
44 1200        3 299000  -1.007523     -0.2236752
45  852        2 179900  -1.445423     -1.5377669
46 1852        4 299900  -0.187090      1.0904165
47 1203        3 239500  -1.003748     -0.2236752


## Finding Alpha((alpha))

Recall from ex2 that the gradient descent equation for the updates of theta is:

[
theta := theta – alpha frac{1}{m} x^T (xtheta^T – y)
]

For finding a good alpha((alpha)) we will use a trial and error approach. The idea is look at how the Cost value (J(alpha)) drops with the number of iterations, the fastest the drop the better, but if goes up then the alpha value is already too large.

The Cost is given by(in vectorized form):

[
J(theta) = frac{1}{2m} (Xtheta – y)^T (Xtheta – y)
]

See the lessons on details how to reach that equation.

Implementing:

# lets try out a few alpha's
alpha = c(0.03, 0.1, 0.3, 1, 1.3, 2)

# store the J values over the iterations
J = array(0,c(50,length(alpha)))
m = length(dta$price) theta = matrix(c(0,0,0), nrow=1) x = matrix(c(rep(1,m), dta$area.scale, dta$bedrooms.scale), ncol=3) y = matrix(dta$price, ncol=1)

delta = function(x,y,th) {
delta = (t(x) %*% ((x %*% t(th)) - y))
return(t(delta))
}

# the cost for a given theta
cost = function(x,y,th,m) {
prt = ((x %*% t(th)) - y)
return(1/m * (t(prt) %*% prt))
}

# run J for 50x, on each alpha
for (j in 1:length(alpha)) {
for (i in 1:50) {
J[i,j] = cost(x,y,theta,m) # capture the Cost
theta = theta - alpha[j] * 1/m * delta(x,y,theta)
}
}

# lets have a look
par(mfrow=c(3,2))
for (j in 1:length(alpha)) {
plot(J[,j], type="l", xlab=paste("alpha", alpha[j]), ylab=expression(J(theta)))
}


alpha 1 seems to be the best.

Setting (alpha=1) and running until convergence:

# running until convergence
for (i in 1:50000) {
theta = theta - 1 * 1/m * delta(x,y,theta)
if (abs(delta(x,y,theta)[2]) < 0.0000001) {
}
}

# 1. The final values of theta
print("Theta:")
print(theta)

# 2. The predicted price of a house with 1650 square feet and 3 bedrooms.
# Don't forget to scale your features when you make this prediction!
print("Prediction for a house with 1650 square feet and 3 bedrooms:")
print(theta %*% c(1, (1650 - mean(dta["area"]))/sd(dta["area"]), (3 - mean(dta["bedrooms"]))/sd(dta["bedrooms"])))

[1] "Theta:"
[,1]     [,2]      [,3]
[1,] 340412.7 110631.1 -6649.474
[1] "Prediction for a house with 1650 square feet and 3 bedrooms:"
[,1]
[1,] 293081.5


## Normal Equations

Given the cost function:

[
J(theta) = frac{1}{2m} sum_{i=1}^m (h_theta(x^{(i)}) – y^{(i)})^2
]

Recall that this function returns how big is the error of our model vs the data. Thus our goal is to minimize it. And in order to find its minimum there is also a more direct approach (instead of using gradient descent) we can just calculate its derivative set it to 0 and find the value of theta:

[
frac{delta}{delta theta_j} J(theta_j) = 0
]

thats for (theta_j). We need of course to account for every j.

If we write it down into matrix notation, calculate its derivatives and set it to 0, then the value of theta will be obtained with:

[
theta = (X^T X)^{-1} (X^T y)
]

That can be easily implemented like so:

x = matrix(c(rep(1,m), mydata$area, mydata$bedrooms), ncol=3)
y = matrix(mydata\$price, ncol=1)
theta.normal = solve(t(x) %*% x) %*% (t(x) %*% y)

# 1. In your program, use the formula above to calculate. Remember that
# while you don't need to scale your features, you still need to add
# an intercept term.
print("Theta:")
print(theta.normal)

# 2. Once you have found  from this method, use it to make a price prediction
# for a 1650-square-foot house with 3 bedrooms. Did you get the same price
# that you found through gradient descent?
print("Price prediction for a 1650-square-foot house with 3 bedrooms")
t(theta.normal) %*%  c(1, 1650, 3)

[1] "Theta:"
[,1]
[1,] 89597.9095
[2,]   139.2107
[3,] -8738.0191
[1] "Price prediction for a 1650-square-foot house with 3 bedrooms"
[,1]
[1,] 293081.5


Normal equations are more direct but also more costly than gradient descent to run, so depending on situation you might need to choose one or the other.

#### References

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...