# Linear Regression using R

September 26, 2012
By

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

## Regression

Through this post I am going to explain How Linear Regression works? Let us start with what is regression and how it works? Regression is widely used for prediction and forecasting in field of machine learning. Focus of regression is on the relationship between dependent and one or more independent variables. The “dependent variable” represents the output or effect, or is tested to see if it is the effect. The “independent variables” represent the inputs or causes, or are tested to see if they are the cause. Regression analysis helps to understand how the value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are kept unchanged. In the regression, dependent variable is estimated as function of independent variables which is called regression function. Regression model involves following variables.

• Independent variables X.
• Dependent variable Y
• Unknown parameter θ

In the regression model Y is function of (X,θ). There are many techniques for regression analysis, but here we will consider linear regression.

## Linear regression

In the Linear regression, dependent variable(Y) is the linear combination of the independent variables(X). Here regression function is known as hypothesis which is defined as below.

hθ(X) = f(X,θ)

Suppose we have only one independent variable(x), then our hypothesis is defined as below.

The goal is to find some values of θ(known as coefficients), so we can minimize the difference between real and predicted values of dependent variable(y). If we take the values of all θ are zeros, then our predicted value will be zero. Cost function is used as measurement factor of linear regression model and it calculates average  squared error for m observations. Cost function is denoted by J(θ) and defined as below.

As we can see from the above formula, if cost is large then, predicted value is far from the real value and if cost is small then, predicted value is nearer to real value. Therefor, we have to minimize cost to meet more accurate prediction.

## Linear regression in R

R is language and environment for statistical computing. R has powerful and comprehensive features for fitting regression models. We will discuss about how linear regression works in R. In R, basic function for fitting linear model is lm(). The format is

fit <- lm(formula, data)

where formula describes model(in our case linear model) and data describes which data are used to fit model. The resulting object(fit in this case) is a list that contains information about the fitted model. The formula typically written as

Y ~ x1 + x2 + … + xk

where ~ separates the dependent variable(y) on the left from independent variables(x1, x2, ….. , xk) from right, and the independent variables are separated by + signs. let’s see simple regression example(example is from book R in action). We have the dataset women which contains height and weight for a set of 15 women ages 30 to 39. we want to predict weight from height. R code to fit this model is as below.

>fit <-lm(weight ~ height, data=women)
>summary(fit)

Output of the summary function gives information about the object fit. Output is as below

Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min      1Q  Median      3Q     Max
-1.7333 -1.1333 -0.3833  0.7417  3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991,	Adjusted R-squared: 0.9903
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Let’s understand the output. Values of coefficients(θs) are -87.51667 and 3.45000, hence prediction equation for model is as below

Weight = -87.52 + 3.45*height

In the output, residual standard error is cost which is 1.525. Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women  are as below

>women\$weight
Output
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

Predicted values of 15 women are as below

>fitted(fit)
Output
1        2        3        4        5        6        7        8        9
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833
10       11       12       13       14       15
143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

We can see that predicted values are nearer to the actual values.Finally, we understand what is regression, how it works and regression in R.

## Caveat

Here, I want to beware you from the misunderstanding about correlation and causation. In the regression, dependent variable is correlated with the independent variable. This means, as the value of the independent variable changes, value of the dependent variable also changes. But, this does not mean that independent variable cause to change the value of dependent variable. Causation implies correlation , but reverse is not true. For example, smoking causes the lung cancer and smoking is correlated with alcoholism.  Many discussions are there on this topic. if we go deep into than one blog is not enough to explain this.But, we will keep in mind that we will consider correlation between dependent variable and independent variable in  regression.

In the next blog, I will discuss about the real world business problem and how to use regression into it.

### Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya