Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Introduction to R for Data Science Session 6: Linear Regression + EDA, and Normality tests [Linear Regression in R: Exploratory Data Analysis, assumptions of the simple linear model, correlation, and visualization. Predictions from the linear model. Confidence Intervals and Residuals. Inspecting the basic linear model. Infulential cases and the Influence Plot.]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

## Summary of Session 6, 02. June 2016 :: Linear Regression + EDA, and Normality tests.

Linear Regression + EDA and Normality tests. Linear Regression in R: Exploratory Data Analysis, assumptions of the simple linear model, correlation, and visualization. Predictions from the linear model. Confidence Intervals and Residuals. Inspecting the basic linear model. Infulential cases and the Influence Plot.

## R script :: Session 6

```########################################################
# Introduction to R for Data Science
# SESSION 6 :: 2 June, 2016
# Simple Linear Regression in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################

# clear
rm(list=ls())

library(datasets)
data(iris)

### iris data set description:
# https://stat.ethz.ch/R-manual/R-devel/library/iriss/html/iris.html

### Exploratory Data Analysis (EDA)
str(iris)
summary(iris)

### EDA plots
# plot layout: 2 x 2
par(mfcol = c(2,2))
# boxplot iris\$Sepal.Length
boxplot(iris\$Sepal.Length,
horizontal = TRUE,
xlab="Sepal Length")
# histogram: iris\$Sepal.Length
hist(iris\$Sepal.Length,
main="",
xlab="Sepal.Length",
prob=T)
# overlay iris\$Sepal.Length density function over the empirical distribution
lines(density(iris\$Sepal.Length),
lty="dashed",
lwd=2.5,
col="red")
# boxplot iris\$Petal.Length
boxplot(iris\$Petal.Length,
horizontal = TRUE,
xlab="Petal Length")
# histogram: iris\$Petal.Length,
hist(iris\$Petal.Length,
main="",
xlab="Petal Length",
prob=T)
# overlay iris\$Petal.Length density function over the empirical distribution
lines(density(iris\$Petal.Length),
lty="dashed",
lwd=2.5,
col="red")```

```# NOTE: Boxplot "fences" and outlier detection
# Boxplot in R recognizes as outliers those data points that are found beyond OUTTER fences
# Source: http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
# Q3 = 75 percentile, Q1 = 25 percentile
# IQ = Q3 - Q1; Interquartile range
# lower inner fence: Q1 - 1.5*IQ
# upper inner fence: Q3 + 1.5*IQ
# lower outer fence: Q1 - 3*IQ
# upper outer fence: Q3 + 3*IQ
# A point beyond an inner fence on either side is considered a mild outlier
# A point beyond an outer fence is considered an extreme outlier

# plot variable density in general: Sepal Width
# plot layout
par(mfcol = c(1,2))
# NOTE: this is kernel density estimation in R. You are not testing any distribution yet.
PLengthDensity <- density(iris\$Petal.Length)
plot(PLengthDensity, main="Petal Length Probability Density",
xlab = "Petal Length",cex.main = .7)
polygon(PLengthDensity,
col="red", border="black")

### Step 1: Pearson Correlation
## check whether the assumption of normality holds: Sepal Widht
# NOTE: check the documentation of the ks.test()
# URL: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html
ksPLength <- ks.test(iris\$Petal.Length,
"pnorm",
mean(iris\$Petal.Length),
sd(iris\$Petal.Length),
alternative = "two.sided",
exact = NULL)
ksPLength # According to KS, Petal.Length is not Gaussian
# NOTE: If a single-sample test is used, the parameters specified in ...
# must be pre-specified and not estimated from the data. {TRUE}
# Thus what we're doing here is not acceptable. Remember how many times have you seen
# the KS test applied in this manner? Imagine...

# NOTE: Gaussian Density in R
# All R distribution and density functions are similar
# dnorm(x, mean = 0, sd = 1, log = FALSE) for density (PDF)
# pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) for distribution function (CDF)
# NOTE: lower.tail = FALSE --> P[X>x], survivor (ie. decumulative probability) function
# qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) (quantiles)
# rnorm(n, mean = 0, sd = 1) # random deviates

# plot variable density in general: Sepal Length
SLengthDensity <- density(iris\$Sepal.Length)
plot(SLengthDensity, main="Sepal Length Probability Density",
xlab = "Sepal Length", cex.main = .7)
polygon(SLengthDensity,

```## check whether the assumption of normality holds: Sepal Length
ksSLength <- ks.test(iris\$Sepal.Length,
"pnorm",
mean(iris\$Sepal.Length),
sd(iris\$Sepal.Length),
alternative = "two.sided",
exact = NULL)
ksSLength # KS test says this is a normal distribution

## AGAIN: check whether the assumption of normality holds: Petal.Lenght
# Shapiro-Wilk Test :: better for small samples + for Normal distribution only!
# NOTE: The most powerful statistical test given for a given alpha and compared
# to other tests similar in purpose!
swPLength <- shapiro.test(iris\$Petal.Length)
swPLength
# Null hypothesis: the data are normally distributed...
swPLength\$statistic
swPLength\$p # ... - while SW test says they are not normally distributed.
## Again: check whether the assumption of normality holds: Sepal Length
swSLength <- shapiro.test(iris\$Sepal.Length)
swSLength
# Null hypothesis: the data are normally distributed...
swSLength\$statistic
swSLength\$p # ... - and to SW test they do not seem to to be.

# Take-home message: good luck with normality tests

# more plots to check the normality assumption
theoNormSLength <- qnorm(ppoints(iris\$Sepal.Length),
mean(iris\$Sepal.Length),
sd(iris\$Sepal.Length))
# ?ppoints - produces evenly spaced probability points
qqplot(theoNormSLength,iris\$Sepal.Length,
main = "Q-Q Plot: Sepal Length",
xlab = "Theoretical Quantiles",
ylab = "Empirical Quantiles",
xlim = c(min(theoNormSLength),max(theoNormSLength)),
ylim = c(min(theoNormSLength),max(theoNormSLength)),
cex.main = .7
)
abline(0,1,col="red")```

```# ... and more plots to check the normality assumption
theoNormPLength <- qnorm(ppoints(iris\$Petal.Length),
mean(iris\$Petal.Length),
sd(iris\$Petal.Length))
qqplot(theoNormPLength,iris\$Petal.Length,
main = "Q-Q Plot: Petal Length",
xlab = "Theoretical Quantiles",
ylab = "Empirical Quantiles",
xlim = c(min(theoNormPLength),max(theoNormPLength)),
ylim = c(min(theoNormPLength),max(theoNormPLength)),
cex.main = .7
)

```## Pearson correlation in R {base}
cor1 <- cor(iris\$Sepal.Length, iris\$Petal.Length,
method="pearson")
cor1
par(mfcol = c(1,1))
plot(iris\$Sepal.Length, iris\$Petal.Length,
main = "Sepal Length vs Petal Length",
xlab = "Sepal Length", ylab = "Petal Length")

## Correlation matrix and treatment of missing data
dSet <- iris
# Remove one discrete variable
dSet\$Species <- NULL
# introduce NA in dSet\$Sepal.Length
dSet\$Sepal.Length <- NA

# Pairwise and Listwise Deletion:
cor1a <- cor(dSet,use="complete.obs") # listwise deletion
cor1a
cor1a <- cor(dSet,use="pairwise.complete.obs") # pairwise deletion
cor1a
cor1a <- cor(dSet,use="all.obs") # all observations - error
cor1a

# A fact about R: cor{base} does not test for statistical significance. True.
library(Hmisc)
cor2 <- rcorr(iris\$Sepal.Length,
iris\$Petal.Length,
type="pearson")
cor2\$r # correlations
cor2\$r[1,2] # that's what you need, right
cor2\$P # significant at
cor2\$n # num. observations
# NOTE: rcorr uses Pairwise deletion!

# Correlation matrix
cor2a <- rcorr(as.matrix(dSet),
type="pearson") # NOTE: as.matrix
# select significant at alpha == .05
w <- which(!(cor2a\$P<.05),arr.ind = T)
cor2a\$r[w] <- NA
cor2a\$P # compare w.
cor2a\$r

# But the assumption of normality seems to fail for one variable!
# Use Spearman's rho (essentially, Pearson's R over ranks):
cor2a <- rcorr(as.matrix(dSet),
type="spearman") # NOTE: as.matrix
cor2a

# Linear Regression: lm()
# Predicting: Petal Length from Sepal Length
reg <- lm(Petal.Length ~ Sepal.Length, data=iris)
class(reg)
summary(reg)
coefsReg <- coefficients(reg)
coefsReg
slopeReg <- coefsReg
interceptReg <- coefsReg

# Prediction from this model
newSLength <- data.frame(Sepal.Length = runif(100,
min(iris\$Sepal.Length),
max(iris\$Sepal.Length))
) # watch the variable names in the new data.frame!
predictPLength <- predict(reg, newSLength)
predictPLength

# Results, more thoroughly
regSum <- summary(reg)
regSum

# Coefficient of determination
regSum\$r.squared

# Residuals
regSum\$residuals

# Coefficients
regSum\$coefficients

# F-statistic
regSum\$fstatistic

# Confidence Intervals
confint(reg, level=.95) # 95% CI
confint(reg, level=.99) # 99% CI
# 95% CI for slope only
confint(reg, "price", level=.95)

# Standardized regression coefficients {QuantPsych}
library(QuantPsyc)
lm.beta(reg)

# Reminder: standardized regression coefficients are...
# What you would obtain upon performing linear regression over standardized variables
# z-score in R
zSLength <- scale(iris\$Sepal.Length, center = T, scale = T); # computes z-score
zPLength <- scale(iris\$Petal.Length, center = T, scale = T); # again; ?scale
# new dSet w. standardized variables
dSet <- data.frame(Sepal.Length <- zSLength,
Petal.Length <- zPLength);
# Linear Regression w. lm() over standardized variables
reg1 <- lm(Petal.Length ~ Sepal.Length, data=dSet)
summary(reg1)
# compare
coefficients(reg1) # beta from reg1
lm.beta(reg) # standardized beta w. QuantPscy lm.beta from reg

# plots w. {base} and {ggplot2}
library(ggplot2)

# Predictor vs Criterion {base}
plot(iris\$Sepal.Length, iris\$Petal.Length,
main = "Petal Length vs Sepal Length",
xlab = "Sepal Length",
ylab = "Petal Length"
)
abline(reg,col="red")```

```# Predictor vs Criterion {ggplot2}
ggplot(data = iris,
aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(size = 2, colour = "black") +
geom_point(size = 1, colour = "white") +
geom_smooth(aes(colour = "red"),
method='lm') +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme(legend.position = "none")```

# check normality of residuals

```resStReg <- rstandard(reg) # get *standardized* residuals from reg
qqnorm(resStReg)
qqline(resStReg, col="red")```

```# Predicted vs. residuals {ggplot2}
predReg <- predict(reg) # get predictions from reg
resReg <- residuals(reg) # get residuals from reg
# resStReg <- rstandard(reg) # get residuals from reg
plotFrame <- data.frame(predicted = predReg,
residual = resReg);
ggplot(data = plotFrame,
aes(x = predicted, y = residual)) +
geom_point(size = 2, colour = "black") +
geom_point(size = 1, colour = "white") +
geom_smooth(aes(colour = "blue"),
method='lm',
se=F) +
ggtitle("Predicted vs Residual Lengths") +
xlab("Predicted Lengths") + ylab("Residual") +
theme(legend.position = "none")```

```## Detect influential cases
infReg <- as.data.frame(influence.measures(reg)\$infmat)

# Cook's Distance: Cook and Weisberg (1982):
# values greater than 1 are troublesome
wCook <- which(infReg\$cook.d>1) # we're fine here

# Average Leverage = (k+1)/n, k - num. of predictors, n - num. observations
# Also termed: hat values, range: 0 - 1
# see: https://en.wikipedia.org/wiki/Leverage_%28statistics%29
# Various criteria (twice the leverage, three times the average...)
wLev <- which(infReg\$hat>2*(2/length(iris\$price))) # we seem to be fine here too...

## Influence plot
infReg <- as.data.frame(influence.measures(reg)\$infmat)
plotFrame <- data.frame(residual = resStReg,
leverage = infReg\$hat,
cookD = infReg\$cook.d)
ggplot(plotFrame,
aes(y = residual,
x = leverage)) +
geom_point(size = plotFrame\$cookD*100, shape = 1) +
ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
theme(plot.title = element_text(size=8, face="bold")) +
ylab("Standardized Residual") + xlab("Leverage")```

```# {broom} tools to access the linear regression model
library(broom)
tidy(reg)
glance(reg)
augment(reg)
# deviance is another measure of goodness-of-fit - used w. MLE
# more on deviance in Session 8 on Logistic Regression```