[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Linear Regression with R

Chances are you had some prior exposure to machine learning and statistics. Basically, that’s all linear regression is – a simple statistics problem.

Need help with Machine Learning solutions? Reach out to Appsilon.

Today you’ll learn the different types of linear regression and how to implement all of them in R.

Navigate to a section:

## Introduction to Linear Regression

Linear regression is a simple algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. Needless to say, the output variable (what you’re predicting) has to be continuous. The output variable can be calculated as a linear combination of the input variables.

There are two types of linear regression:

• Simple linear regression – only one input variable
• Multiple linear regression – multiple input variables

You’ll implement both today – simple linear regression from scratch and multiple linear regression with built-in R functions.

You can use a linear regression model to learn which features are important by examining coefficients. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value.

That’s how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added.

There’s still one thing we should cover before diving into the code – assumptions of a linear regression model:

• Linear assumption — model assumes that the relationship between variables is linear
• No noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
• No collinearity — model will overfit when you have highly correlated input variables
• Normal distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
• Rescaled inputs — use scalers or normalizer to make more reliable predictions

You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects.

## Simple Linear Regression from Scratch

If you have a single input variable, you’re dealing with simple linear regression. It won’t be the case most of the time, but it can’t hurt to know. A simple linear regression can be expressed as:

Image 1 – Simple linear regression formula (line equation)

As you can see, there are two terms you need to calculate beforehand – betas.

You’ll first see how to calculate Beta1, as Beta0 depends on it. Here’s the formula:

Image 2 – Beta1 equation

And here’s the formula for Beta0:

Image 3 – Beta0 equation

These x’s and y’s with the bar over them represent the mean (average) of the corresponding variables.

Let’s see how all of this works in action. The code snippet below generates X with 300 linearly spaced numbers between 1 and 300, and generates Y as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the ggplot2 package:

Image 4 – Input data as a scatter plot

Onto the coefficient calculation now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a simple_lr_predict() function that implements the line equation.

The predictions can then be obtained by applying the simple_lr_predict() function to the vector X – they should all line on a single straight line. Finally, input data and predictions are visualized with the ggplot2 package:

Image 5 – Input data as a scatter plot with predictions (best-fit line)

And that’s how you can implement simple linear regression in R from scratch! Next, you’ll learn how to handle situations when there are multiple input variables.

## Multiple Linear Regression with R

You’ll use the Fish Market dataset to build your model. To start, the goal is to load in the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots.

The code snippet below loads in the dataset and visualizes box plots for every feature (not the target):

Image 6 – Boxplots of the input features

A degree of skew seems to be present in all input variables, and the first three contain a couple of outliers. We’ll keep this article strictly machine learning based, so we won’t do any data preparation and cleaning.

Train/test split is the obvious next step once you’re done with preparation. The caTools package is the perfect candidate for the job.

You can train the model on the training set after the split. R has the lm function built-in, and it is used to train linear models. Inside the lm function, you’ll need to write the target variable on the left and input features on the right, separated by the ~ sign. If you put a dot instead of feature names, it means you want to train the model on all features.

After the model is trained, you can call the summary() function to see how well it performed on the training set. Here’s a code snippet for everything discussed so far:

Image 7 – Summary statistics of a multiple linear regression model

The most interesting thing here is the P-values, displayed in the Pr(>|t|) column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.

Let’s make a residuals plot next. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it somehow. Here’s the code for visualizing residuals:

Image 8 – Residuals plot of a multiple linear regression model

As you can see, there’s a bit of skew present due to a large error on the far right.

And now it’s time to make predictions on the test set. You can use the predict() function to apply the model to the test set. As an additional step, you can combine actual values and predictions into a single data frame, just so the evaluation becomes easier. Here’s how:

Image 9 – Dataset comparing actual values and predictions for the test set

If you want a more concrete way of evaluating your regression models, look no further than RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of weight units the model is wrong:

The rmse variable holds the value of 83.60, indicating the model is on average wrong by 83.60 units of weight.

## Conclusion

Today you’ve learned how to train linear regression models in R. You’ve implemented a simple linear regression model entirely from scratch, and a multiple linear regression model with built-in function on the real dataset.

You’ve also learned how to evaluate the model through summary functions, residuals plots, and various metrics such as MSE and RMSE.

If you want to implement machine learning in your organization, you can always reach out to Appsilon for help.