[This article was first published on R Statistics Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Linear regression is a supervised machine learning algorithm that is used to predict the continuous variable. The algorithm assumes that the relation between the dependent variable(Y) and independent variables(X), is linear and is represented by a line of best fit. In this chapter, we will learn how to execute linear regression in R using some select functions and test its assumptions before we use it for a final prediction on test data.

# Overview – Linear Regression

In statistics, linear regression is used to model a relationship between a continuous dependent variable and one or more independent variables. The independent variable can be either categorical or numerical. The case when we have only one independent variable then it is called as simple linear regression. If we have more than one independent variable, then it is called as multivariate regression.

A mathematical representation of a linear regression model is as give below:

Y = β_0 + β_1X_1 + β_2X_2 + β_3X_3 + ….. + β_nX_n + error

In the above equation, β_0 coefficient represents intercept and β_i coefficient represents slope. Here we will be using a case study approach to help you understand the linear regression algorithm.

In the below case study, we will be using USA housing data to predict the price. Let us look at the top six observations of USA housing data.

# Reading data
header = TRUE, sep = ",")
# Print top 6 observations


 AreaIncome AreaHouse AreaNumberofRooms AreaNumberofBedrooms AreaPopulation     Price
1   79545.46  5.682861          7.009188                 4.09       23086.80 1059033.6
2   79248.64  6.002900          6.730821                 3.09       40173.07 1505890.9
3   61287.07  5.865890          8.512727                 5.13       36882.16 1058988.0
4   63345.24  7.188236          5.586729                 3.26       34310.24 1260616.8
5   59982.20  5.040555          7.839388                 4.23       26354.11  630943.5
6   80175.75  4.988408          6.104512                 4.04       26748.43 1068138.1


# Exploratory Data Analysis

Exploratory data analysis exercise is critical to any project related to Machine Learning. It is an approach to understand and summarize the main characteristics of a given data. Mostly, this involves slicing and dicing of data at different levels, and results are often presented with visual methods. If done correctly, it can reveal many aspects of the data, which will surely help you build better models.

Every dataset is different, and thus, it isn’t easy to list down steps one should perform as part of data exploration. However, the key to a successful EDA is to keep asking the questions which one believes helps in solving the business problem or put across all sorts of hypothesis and then testing them using appropriate statistical tests. Read

In other words, try to figure if there is a statistically significant relationship between the target and independent variables. What are the things which derive target variables?

Below are few things which we should consider exploring from the statistical point of view:

1. Checking distribution of target variable – First, you should always try to understand the nature of your target variable. To achieve this, we will be drawing a histogram with a density plot.

library(ggplot2)
# Building histogram
ggplot(data=housing, aes(housing$Price)) + geom_histogram(aes(y =..density..), fill = "orange") + geom_density()  The price variable follows normal distribution and It is good that the target variable follows a normal distribution from linear regressions perspective. If you are wondering why so? Then don’t worry we got that covered in coming sections. 2. Analyzing Summary Statistics – Here, we will simply create summary statistics for all the variables to understand the behavior of all the independent variables. It will also provide information about missing values or outliers if any. For more information and functions which you can use read beginner’s guide to exploratory data analysis. Both missing values and outliers are of concern for Machine Learning models as they tend to push the result towards extreme values. # loading psych package library(psych) psych::describe(housing)   vars n mean sd median trimmed mad min max range AreaIncome 1 5000 68583.11 10657.99 68804.29 68611.84 10598.27 17796.63 107701.75 89905.12 AreaHouse 2 5000 5.98 0.99 5.97 5.98 0.99 2.64 9.52 6.87 AreaNumberofRooms 3 5000 6.99 1.01 7.00 6.99 1.01 3.24 10.76 7.52 AreaNumberofBedrooms 4 5000 3.98 1.23 4.05 3.92 1.33 2.00 6.50 4.50 AreaPopulation 5 5000 36163.52 9925.65 36199.41 36112.49 9997.21 172.61 69621.71 69449.10 Price 6 5000 1232072.65 353117.63 1232669.38 1232159.69 350330.42 15938.66 2469065.59 2453126.94 skew kurtosis se AreaIncome -0.03 0.04 150.73 AreaHouse -0.01 -0.09 0.01 AreaNumberofRooms -0.04 -0.08 0.01 AreaNumberofBedrooms 0.38 -0.70 0.02 AreaPopulation 0.05 -0.01 140.37 Price 0.00 -0.06 4993.84  3. Checking Outliers Using Boxplots – To learn more about outliers and how to identify, please read – How To Identify & Treat Outliers Using Uni-variate Or Multivariate Methods. Here are using a boxplot for plotting the distribution of each numerical variable to check for outliers. If points lie beyond whispers, then we have outlier values present. For now, we are just going by univariate outlier analysis. But I encourage you to check for outliers at a multivariate level as well. If outliers are present, then you must either remove or do a proper treatment before moving forward. library(reshape) meltData <- melt(housing) p <- ggplot(meltData, aes(factor(variable), value)) p + geom_boxplot() + facet_wrap(~variable, scale="free")  Apart from Area of Number of Bedrooms all other variables seem to have outliers 4. Correlation Matrix Visualization We will use corrgram package to visualize and analyze the correlation matrix. To learn more about how to check the significance of correlation and different ways of visualizing the correlation matrix, please read Correlation In R – A Brief Introduction. In theory, the correlation between the independent variables should be zero. In practice, we expect and are okay with weak to no correlation between independent variables. We also expect that independent variables reflect a high correlation with the target variable. require(corrgram) corrgram(housing, order=TRUE)  # Training Regression Model To build a linear regression, we will be using lm() function. The function takes two main arguments. • Formula stating the dependent and independent variables separated by ~(tilder). • The dataset name. • There are other useful arguments and thus would request you to use help(lm) to read more from the documentation. ## Diving data into train and test subsets The housing data is divided into 70:30 split of train and test. The 70:30 split is the most common and is mostly used during the training phase. 70% of the data is used for training, and the rest 30% is for testing how good we were able to learn the data behavior. library(caret) # Split data into train and test index <- createDataPartition(housing$Price, p = .70, list = FALSE)
train <- housing[index, ]
test <- housing[-index, ]

# Checking the dim of train
dim(train)


# Output
 3500    6


You can see we have 70% of the random observations in the training dataset.

## Building Model

# Taining model
lmModel <- lm(Price ~ . , data = train)
# Printing the model object
print(lmModel)


# Output
Call:
lm(formula = Price ~ ., data = housing)

Coefficients:
(Intercept)            AreaIncome             AreaHouse
-2637299.03                 21.58             165637.03
AreaNumberofRooms  AreaNumberofBedrooms        AreaPopulation
120659.95               1651.14                 15.20


## Interpreting Regression Coefficients

In the above output, Intercept represents that the minimum value of Price that will be received, if all the variables are constant or absent.

Intercept may not always make sense in business terms.

Slope(represented by independent variables) tells us about the rate of change that the Price variable will witness, with every one unit change in the independent variable. For example – if AreaHouse of house increases by one more unit, the Price of the house will increase by 165,637.

# Validating Regression Coefficients and Models

We must ensure that the value of each beta coefficient is significant and has not come by chance. In R, the lm function runs a one-sample t-test against each beta coefficient to ensure that they are significant and have not come by chance. Similarly, we need to validate the overall model. Just like a one-sample t-test, lm function also generates three statistics, which help data scientists to validate the model. These statistics include R-Square, Adjusted R-Square, and F-test, also known as global testing.

To view these statistics, we need to pass the lmModel object to the summary() function.

# Checking model statistics
summary(lmModel)


# Output
Call:
lm(formula = Price ~ ., data = housing)

Residuals:
Min      1Q  Median      3Q     Max
-337020  -70236     320   69175  361870

Coefficients:
Estimate    Std. Error  t value            Pr(>|t|)
(Intercept)          -2637299.0333    17157.8092 -153.708 <0.0000000000000002 ***
AreaIncome                 21.5780        0.1343  160.656 <0.0000000000000002 ***
AreaHouse              165637.0269     1443.4130  114.754 <0.0000000000000002 ***
AreaNumberofRooms      120659.9488     1605.1604   75.170 <0.0000000000000002 ***
AreaNumberofBedrooms     1651.1391     1308.6712    1.262               0.207
AreaPopulation             15.2007        0.1442  105.393 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 101200 on 4994 degrees of freedom
Multiple R-squared:  0.918,    Adjusted R-squared:  0.9179
F-statistic: 1.119e+04 on 5 and 4994 DF,  p-value: < 0.00000000000000022


In the above output, Pr(>|t|) represents the p-value, which can be compared against the alpha value of 0.05 to ensure if the corresponding beta coefficient is significant or not. The lm function here lends a helping hand. All values in the output that have (.) period or (*) astric against the variable names indicates that these values are significant. Based upon this, we now know that all variables are statistically significant except AreaNumberofBedrooms.

For overall model accuracy, let’s discuss statistics generated by lm function one by one.

1. Multiple R-squared: 0.918 – The R-squared value is formally called a coefficient of determination. Here, 0.918 indicates that the intercept, AreaIncome, AreaHouse, AreaNumberofRooms, and AreaPopulation variables, when put together, are able to explain 91.8% of the variance in the Price variable. The value of R-squared lies between 0 to 1. In practical applications, if the R2 value is higher than 0.70, we consider it a good model.

2. Adjusted R-squared: 0.9179 – The Adjusted R-squared value tells if the addition of new information ( variable ) brings significant improvement to the model or not. So as of now, this value does not provide much information. However, the increase in the adjusted R-squared value with the addition of a new variable will indicate that the variable is useful and brings significant improvement to the model.

A large difference between the R-Squared and Adjusted R-squared is not appreciated and generally indicates that multicollinearity exists within the data.

3. F-statistic: 1.119e+04 on 5 and 4994 DF, p-value: < 0.00000000000000022 – This line talks about the global testing of the model. The lm function runs an ANOVA test to check the significance of the overall model. Here the null hypothesis is that the model is not significant, and the alternative is that the model is significant. According to the p-values < 0.05, our model is significant.

Albeit, looking at these statistics is enough to take a call on the model significance. But there are other validation methods for linear regression that can be of help while deciding how good or bad the model is. Some of them are mentioned below:

4. AIC and BIC values – The AIC(Akaike’s information criterion, 1974) and BIC(Bayesian information criterion, 1978) are penalized-likelihood criteria. Both these measures use a “measure of fit + complexity penalty” to get the final values.

AIC = – 2 * ln(likelihood) + 2 * p

BIC = – 2 * ln(likelihood) + ln(N) * p

Here p = number of estimated parameters and N = sample size.

The AIC and BIC values can be used for choosing the best predictor subsets in regression and for comparing different models. When comparing different models, the model with minimum AIC and BIC values is considered the best model.

Note
AIC is likely to overfit the data, whereas BIC is susceptible to underfit the data.

# Using AIC function
AIC(lmModel)
# Using BIC function
BIC(lmModel)


# Output AIC
 129441.3

# Output BIC
 129486.9


5. Root Mean Square Error(RMSE) – By comparing the RMSE statistics of different models, we can decide which is a better model. The model with the lower RMSE value is considered a better model. There are other similar functions like MAE, MAPE, MSE, and so on that can be used. These functions can be found in Metrics R Package. These functions take majorly two arguments: One is the actual value and second, predicted values. So let’s see how we can get these values. The actuals can be 100% found from the original dataset or the training data in our case. However, to find the fitted values, we need to explore the model object.

# Checking model object for actual and predicted values
names(lmModel)


# Output
 "coefficients"  "residuals"
 "effects"       "rank"
 "fitted.values" "assign"
 "qr"            "df.residual"
 "xlevels"       "call"
 "terms"         "model"


The above vector presents the names of the object that constitute the model object. Here, fitted values are the predicted values. Now, we will use these values to generate the rmse values.

library(Metrics)
rmse(actual = train$Price, predicted = lmModel$fitted.values)


# Output
 493370.4


# Checking Assumptions of Linear Regression

Linear regression is parametric, which means the algorithm makes some assumptions about the data. A linear regression model is only deemed fit is these assumptions are met. There are about four assumptions and are mentioned below. If the model fails to meet these assumptions, then we simply cannot use this model.

1. Errors should follow normal distribution – This can be checked by drawing a histogram of residuals or by using plot() function. The plot function creates 4 different charts. One of which is an NPP plot. The chart confirms if the errors follow a normal distribution or not.

## Generating histogram

# Histogram to check the distribution of errors
hist(lmModel$residuals, color = "grey")  The above histogram of errors clearly states that errors are normally distributed. ## Generating NPP plot We except the points to be very close to the dotted line in an NPP plot. Points being close to the line means that errors follow a normal distribution. plot(lmModel)  2. There should be no heteroscedasticity – This means that the variance of error terms should be constant. We shall not see any patterns when we draw a plot between residuals and fitted values. And the mean line should be close to Zero. ## Generating the scatterplot between residuals and fitted values # Using plot function plot(lmModel) checking heteroskedasticity in r A straight red line closer to the zero value represents that we do not have heteroscedasticity problem in our data. 3. There should be no multicollinearity – The linear model assumes that the predictor variables do not correlate with each other. If they exhibit high correlation, it is a problem and is called multicollinearity. A variation inflation factor test can help check for the multicollinearity assumption. VIF = 1/(1-R2) The R implementation of the below function can be found here. VIF is an iterative process. The function will remove one variable at a time, which is cause for multicollinearity and repeats the process until all problem causing variables are removed. So, finally, we are left with the list of variables that have no or very weak correlation between them. vif_func(housing[, 1:5]) var vif AreaIncome 1.00115868968647 AreaHouse 1.00057658981485 AreaNumberofRooms 1.27353508823836 AreaNumberofBedrooms 1.27441273719468 AreaPopulation 1.00126579728799 All variables have VIF < 10, max VIF 1.27  "AreaIncome"  "AreaHouse"  "AreaNumberofRooms"  "AreaNumberofBedrooms"  "AreaPopulation"  There is no multicollinearity problem in the dataset. Generally, VIF values which are greater than 5 or 7 are the cause of multicollinearity. 3. There should be no auto serial correlation – The autocorrelation means that error terms should not be correlated with each other. To check this, we can run the Durbin-Watson test(dw test). The test returns a value between 0 and 4. If the value is two, we say there is no auto serial correlation. However, a value higher than 2 represents (-) ve correlation and value lower than 2 represents (+) ve correlation. library("lmtest") dwtest(lmModel) # Output Durbin-Watson test data: lmModel DW = 1.9867, p-value = 0.3477 alternative hypothesis: true autocorrelation is greater than 0  We got a value of 1.9867 which suggests that there is no auto serial correlation. Our model met all the four assumptions of linear regression. ## Predicting Dependent Variable(Y) in Test Dataset We test the model performance on test data set to ensure that our model is stable, and we get the same or closer enough results to use this trained model to predict and forecast future values of dependent variables. To predict, we use predict function, and then we generate R-Squared value to see if we get the same result as we got in the training dataset or not. # Predicting Price in test dataset test$PreditedPrice <- predict(lmModel, test)
# Priting top 6 rows of actual and predited price
head(test[ , c("Price", "PreditedPrice")])

# Output
Price   PreditedPrice
2  1505890.9     1494997.3
3  1058988.0     1253539.2
8  1573936.6     1572965.3
12  663732.4      629153.8
23  718887.2      727497.8
24  743999.8      991034.7


## Generating R-Squared Value for the test dataset

We are using a user-defined formula to generate the R-Squared value here.

actual <- test$Price preds <- test$PreditedPrice
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq

# Output
 0.9140436


Our model is performing fantastic.

In the test dataset, we got an accuracy of 0.9140436 and a training data set, we got an accuracy of 0.918.

In this chapter, We learned many things related to linear regression from a practical and theoretical point of view. We learned when to use linear regression, how to use it, how to check the assumptions of linear regression, how to predict the target variable in test dataset using trained model object, and we also learned how to validate the linear regression model using different statistical methods. In the next chapter, we will learn about an advanced linear regression model called ridge regression.