Simple linear regression in r, we want to create models to investigate and forecast the relationship between variables, and the most basic relationship that we can think of is a straight line.
Visit finnstats.com for up-to-date and accurate lessons.
Let’s take a look at the first linear relationship that we are going to create.
Simple Linear Regression in r
Let’s load Boston Housing data from mlbench package.
library(mlbench) data("BostonHousing2") head(BostonHousing2) dim(BostonHousing2)
The data set contains 506 rows and 19 columns.
Now we can check the association between the average number of rooms in a house and the median house price from this data set.
Now we can make use of ggplot for making a scatterplot.
library(ggplot2) ggplot(BostonHousing2,mapping = aes(y=medv,x=rm)) + geom_point() + xlab("Average number of Rooms") + ylab("Median House Price")
The average house price and the number of rooms have a strong linear relationship.
Ok, let’s see another example of the relationship between the price of a diamond and the number of carats using a fancy hexbin plot.
Let’s see the dataset first,
head(diamonds) carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 library(hexbin) ggplot(diamonds, mapping = aes(x = carat, y = price)) + geom_hex(bins=50)
Now, let’s look at the plot it doesn’t appear to be linear to me, but we can make it while making small changes.
ggplot(diamonds, mapping = aes(x = log10(carat), y = log10(price))) + geom_hex(bins=50)
We’ll look at using the log carat to forecast a diamond’s log price.
lm <- lm(log(price) ~ log(carat), data = diamonds) summary(lm) Call: lm(formula = log(price) ~ log(carat), data = diamonds) Residuals: Min 1Q Median 3Q Max -1.50833 -0.16951 -0.00591 0.16637 1.33793 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.448661 0.001365 6190.9 <2e-16 *** log(carat) 1.675817 0.001934 866.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2627 on 53938 degrees of freedom Multiple R-squared: 0.933, Adjusted R-squared: 0.933 F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
For the linear model we have the following assumptions:
- Linearity (A straight line between log price and log carat)
- Homoscedasticity (noise terms have the same variance)
- Normality (Noise terms are normally distributed)
- Independence (The error terms are independent)
Plot for residual versus fitted values.
plot(lm, which = 1)
The red line aids look at any patterns that exist. It is essentially straight in this example, which indicates no trend in the residuals and assumption satisfied.
Let’s look at the spread
plot(lm, which = 3)
In this scenario, we’d like to see an equitable distribution of points as we move from left to right – no obvious tendencies here.
Here we will make use of which=2
plot(lm, which = 2)
Hope you eagerly waiting for this assumption, It necessitates some understanding of the data’s origins, meaning, and collection methods. So no shortcuts.
All of the assumptions have been met, and we can now use the below formula to forecast the log(price).