Machine Learning Ex4 – Logistic Regression

[This article was first published on YGC » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exercise 4 required implementing Logistic Regression using Newton’s Method.

The dataset in use is 80 students and their grades of 2 exams, 40 students were admitted to college and the other 40 students were not. We need to implement a binary classification model to estimates college admission based on the student’s scores on these two exams.

plot the data

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x <- read.table("ex4x.dat",header=F, stringsAsFactors=F)
x <- cbind(rep(1, nrow(x)), x)
colnames(x) <- c("X0", "Exam1", "Exam2")
x <- as.matrix(x)
 
y <- read.table("ex4y.dat",header=F, stringsAsFactors=F)
y <- y[,1]
 
## plotting data
d <- data.frame(x,
                y = factor(y,
                levels=c(0,1),
                labels=c("Not admitted","Admitted" )
                )
                )
 
require(ggplot2)
p <- ggplot(d, aes(x=Exam1, y=Exam2)) +
    geom_point(aes(shape=y, colour=y)) +
    xlab("Exam 1 score") +
    ylab("Exam 2 score")


Logistic Regression


We first need to define our Hypothesis Function that return values between[0,1],suitable for binary classification.

\(h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }\)

function g is the sigmoid function, and function h return the probability of y=1:

\( h_\theta(x) = P (y=1 | x; \theta) \)

What we need is to compute \( \theta \) ,to find out the proper Hypothesis Function.

Similar to the linear regression,We defined the cost function, which estimate the error of hypothesis function fitting the sample data, at a given \( \theta\) .

The cost function was defined as:
\(J(\theta) = \frac{1}{m} \sum_{i=1}^m ((-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x)) )\)

To determine the most suitable hypothesis function, we need to find the \( \theta\) value which minimize the value of \(J(\theta)\) . This can be achieved by the Newton's method, by finding the root of the derivative function of the cost function.

And the \( \theta\) can be updated by:
\(\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J\)

the gradient and Hessian are defined as:
\(\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x\)
\(H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x]\)

The above equations were implemented using R:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
### Newton's Method
## sigmoid function
g <- function(z) {
    1/(1+exp(-z))
}
 
## hypothesis function
h <- function(theta, x) {
    g(x %*% theta)
}
 
## cost function
J <- function(theta, x, y) {
    m <- length(y)
    s <- sapply(1:m, function(i)
                y[i]*log(h(theta,x[i,])) + (1-y[i])*log(1-h(theta,x[i,]))
                )
    j <- -1/m * sum(s)
    return(j)
}
 
 
## gradient
grad <- function(theta, x, y) {
    m <- length(y)
    g <- 1/m * t(x) %*% (h(theta,x)-y)
    return(g)
}
 
## Hessian
Hessian <- function(theta, x) {
    m <- nrow(x)
    H <- 1/m * t(x) %*% x * diag(h(theta,x)) * diag(1-h(theta,x))
    return(H)
}

The first question need to determine how many iteration until convergence.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
theta <- rep(0, ncol(x))
j <- rep(0,10)
for (i in 1:10) {
    theta <- theta - solve(Hessian(theta,x)) %*% grad(theta,x,y)
    j[i] <- J(theta,x,y)
}
 
ggplot()+
    aes(x=1:10,y=j)+
    geom_point(colour="red")+
    geom_path()+xlab("Iteration")+
    ylab("Cost J")


As illustrated in the above figure, Newton's method converge very fast, only 4-5 iterations was needed.

The second question:What is the probability that a student with a score of 20 on Exam 1 and a score of 80 on Exam 2 will not be admitted?

> (1 - g(c(1, 20, 80) %*% theta))* 100
         [,1]
[1,] 64.24722

In our model, we predicted that the probability of the student will not admitted is 64%.

At last, we calculate our classification model based on \( \theta^T x = 0\) , and visualize the fit as below:

?View Code RSPLUS
1
2
3
4
5
x1 <- c(min(x[,2]),max(x[,2]))
x2 <- -1/theta[3,] * (theta[2,]*x1+theta[1,])
a <- (x2[2]-x2[1])/(x1[2]-x1[1])
b <- x2[2]-a*x1[2]
p+geom_abline(slope=a, intercept=b)

Fantastic.

References:
Machine Learning Course
Exercise 4

Related Posts

To leave a comment for the author, please follow the link and comment on their blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)