Very warm welcome to first part of my series blog posts. In previous blog post, we discussed about concept of the linear regression and its mathematical model representation. We also tried to implement linear regression in R step by step. In this post I will discuss about the logistic regression and how to implement the logistic regression in R step by step. I hope that readers will love to read this. Again, very much thank to AndrewNG for fabulous explanation of the concept of logistic regression in coursera Machine Learning Class.
So, let’s start get rolling! Let us start first understanding Logistic Regression. You can also view the video lecture from the Machine learning class. (You can skip this part if you know the basic of the logistic regression and jump to the second part, in which I have discussed about the coding part in R to convert mathematical formulas of Logistic regression into R codes.)
Logistic regression is a type of statistical classification model which is used to predict binary response. It measures the relationship between categorical dependent variable and one or more predictor variables. Here categorical variable might be binomial or multinomial. In case of binomial categorical variable, we have only two categories (i.e ‘’yes’’ and ‘’no’’, “good” and ‘’bad”). Where, in case of the multinomial categorical variable, we have more than two categories (i.e. “average” and ”good” and “best”). Here, we will only focus on the binomial dependent variable(source: Wikipedia).
Let us consider the case of the Spam detector which is classification problem. Here Detector system will identify whether a given mail is spam or not spam. So our dependent variable will contains only two values “yes” or “No”. In other words, it will be represented in form of positive class and negative class. We can represent it in following mathematical notation.
This indicates that our hypothesis value will be in range 0 to 1.
We want prediction in range 0 to 1. So let us try to interpret the result of h(x). For example, we get the output result for our hypothesis of spam detector for given email equals 0.7, then it represents 70% probability of mail being spam. Finally, we want to set some threshold for deciding upon whether given mail is spam or not spam. Generally, if probability is greater than 0.5 then it should be classified as spam otherwise not spam.
We can say that total probability of mail being spam or not spam equal to 1. We can write this in following form.
P(Y=0) + P(Y=1) = 1
So, P(Y=0) = 1 – P(Y=1)
Let us discuss on the sigmoid function which is the center part of the logistic regression and hence the name is logistic regression. Sigmoid function is defined as below.
And using this we define our new hypothesis as below.
Let us try to define cost function for logistic regression. Recall the cost function for linear regression.
But in case of the logistic regression, cost function will be defined slightly different. We will not discuss more about it, otherwise post will become too large. You can refer the video of the Machine learning class where Andrew NG has discussed about cost function in detail. Cost function for logistic regression is defined as below.
Again, we will use gradient descent to derive optimal value of thetas. So until now, we have understood the basics of the logistic regression, hypothesis representation, sigmoid function and cost function. In the next part, we will try to implement these things in R step by step and obtain the best fitting parameters.
Powered by Google+ Comments
The post Logistic Regression with R: step by step implementation part-1 appeared first on Pingax.