Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R  + Categorical Predictors, Partial and Part Correlation]

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Welcome to Introduction to R for Data Science Session 7: Multiple Regression + Dummy Coding, Partial and Part Correlations [Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

image

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

Lecturers

Summary of Session 7, 09. June 2016 :: Multiple Regression + Dummy Coding, Partial and Part Correlations.

Multiple Regression + Dummy Coding, Partial and Part Correlations. Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.

Intro to R for Data Science SlideShare :: Session 7


R script :: Session 7

########################################################
# Introduction to R for Data Science
# SESSION 7 :: 9 June, 2016
# Multiple Linear Regression in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear
rm(list=ls())
 
#### read data
library(datasets)
library(broom)
library(ggplot2)
library(lattice)
library(QuantPsyc)
 
#### load
data(iris)
str(iris)
 
#### simple linear regression: Sepal Length vs Petal Lenth
# Predictor vs Criterion {ggplot2}
ggplot(data = iris,
       aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point(size = 2, colour = "black") +
  geom_point(size = 1, colour = "white") +
  geom_smooth(aes(colour = "black"),
              method='lm') +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") +
  theme(legend.position = "none")

image
# What is wrong here?
# let's see...
reg <- lm(Petal.Length ~ Sepal.Length, data=iris) 
summary(reg)
# Hm, everything seems fine to me...
 
# And now for something completelly different (but in R)...
 
#### Problems with linear regression in iris
# Predictor vs Criterion {ggplot2} - group separation
ggplot(data = iris, 
       aes(x = Sepal.Length,
           y = Petal.Length,
           color = Species)) + 
  geom_point(size = 2) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length")

image
# Predictor vs Criterion {ggplot2} - separate regression lines
ggplot(data = iris, 
       aes(x = Sepal.Length,
           y = Petal.Length,
           colour=Species)) + 
  geom_smooth(method=lm) + 
  geom_point(size = 2) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length")

image
### Ooops...
### overview and considerations
plot(iris[,c(1,3,5)],
     main = "Inspect: Sepal vs. Petal Length \nfollowing the discovery of the Species...",
     cex.main = .75,
     cex = .6)

image
### better... {lattice}
xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot
       data = iris,
       xlab = "Sepal Length", ylab = "Petal Length"
       )

image
### Petal.Length and Sepal.Lengt: EDA and distributions
par(mfcol = c(2,2))
# boxplot Petal.Length
boxplot(iris$Petal.Length,
        horizontal = TRUE, 
        xlab="Petal Length")
# histogram: Petal.Length
hist(iris$Petal.Length, 
     main="", 
     xlab="Petal Length", 
     prob=T)
lines(density(iris$Petal.Length),
      lty="dashed", 
      lwd=2.5, 
      col="red")
# boxplot Sepal.Length
boxplot(iris$Sepal.Length,
        horizontal = TRUE, 
        xlab="Sepal Length")
# histogram: Sepal.Length
hist(iris$Sepal.Length, 
     main="", 
     xlab="Sepal Length", 
     prob=T)
lines(density(iris$Sepal.Length),
      lty="dashed", 
      lwd=2.5, 
      col="blue")

image
# Petal Length and Sepal Length: Conditional Densities
densityplot(~ Petal.Length | Species, # {latice} xyplot
       data = iris,
       plot.points=FALSE,
       xlab = "Petal Length", ylab = "Density",
       main = "P(Petal Length|Species)",
       col.line = 'red'
)

image
densityplot(~ Sepal.Length | Species, # {latice} xyplot
            data = iris,
            plot.points=FALSE,
            xlab = "Sepal Length", ylab = "Density",
            main = "P(Sepal Length|Species)",
            col.line = 'blue'
)

image
# Linear regression in subgroups
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1]) # setosa
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) 
tidy(reg)
w2 <- which(iris$Species == species[2]) # versicolor
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,]) 
tidy(reg)
w3 <- which(iris$Species == species[3]) # virginica
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,]) 
tidy(reg)
 
#### Dummy Coding: Species in the iris dataset
is.factor(iris$Species)
levels(iris$Species)
reg <- lm(Petal.Length ~ Species, data=iris) 
tidy(reg)
glance(reg)
# Never forget what the regression coefficient for a dummy variable means:
# It tells us about the effect of moving from the baseline towards the respective reference level!
# Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg))
# NOTE: watch for the order of levels!
levels(iris$Species) # Levels: setosa versicolor virginica
iris$Species <- factor(iris$Species, 
                       levels = c("versicolor", 
                                  "virginica",
                                  "setosa"))
levels(iris$Species)
# baseline is now: versicolor
reg <- lm(Petal.Length ~ Species, data=iris) 
tidy(reg) # The regression coefficents (!): figure out what has happened!
 
### another way to do dummy coding
rm(iris); data(iris) # ...just to fix the order of Species back to default
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
contrasts(iris$Species) # this probably what you remember from your stats class...
iris$Species <- factor(iris$Species, 
                       levels = c ("virginica","versicolor","setosa"))
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
# baseline is now: virginica
contrasts(iris$Species) # consider carefully what you need to do
 
### Petal.Length ~ Species (Dummy Coding) + Sepal.Length 
rm(iris); data(iris) # ...just to fix the order of Species back to default
reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris)
# BTW: since is.factor(iris$Species)==T, R does the dummy coding in lm() for you
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
# compare w. Simple Linear Regression
reg <- lm(Petal.Length ~ Sepal.Length, data=iris) 
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
 
### Comparing nested models
reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2
# terminology: reg2 is a "full model"
# this terminology will be used quite often in Logistic Regression
 
# NOTE: Nested models
# There is a set of coefficients for the nested model (reg1) such that it
# can be expressed in terms of the full model (reg2); in our case it is simple 
# HOME: - figure it out.
 
anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length
# NOTE: for partial F-test, see:
# http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf
 
# Influence Plot
regFrame <- augment(reg2)
## Influence plot
# influnce data
infReg <- as.data.frame(influence.measures(reg)$infmat)
# data.frame for ggplot2
plotFrame <- data.frame(residual = regFrame$.std.resid,
                        leverage = regFrame$.hat,
                        cookD = regFrame$.cooksd)
 
ggplot(plotFrame,
       aes(y = residual,
           x = leverage)) +
  geom_point(size = plotFrame$cookD*100, shape = 1) +
  ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
  theme(plot.title = element_text(size=8, face="bold")) +
  ylab("Standardized Residual") + xlab("Leverage")

image
#### Multiple Regression - by the book
# Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression
# (that's from your reading list, to remind you...)
data(stackloss)
str(stackloss)
# Data set description
# URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html
# Air Flow represents the rate of operation of the plant. 
# Water Temp is the temperature of cooling water circulated through coils in the absorption tower. 
# Acid Conc. is the concentration of the acid circulating.
# stack.loss (the dependent variable) is 10 times the percentage of the ingoing ammonia to 
# the plant that escapes from the absorption column unabsorbed;
# that is, an (inverse) measure of the over-all efficiency of the plant.
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
                    data=stackloss)
 
# let's see:
summary(stacklossModel)
glance(stacklossModel) # {broom}
tidy(stacklossModel) # {broom}
 
# predict new data
obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(stacklossModel, obs)
 
# confidence intervals
confint(stacklossModel, level=.95) # 95% CI
confint(stacklossModel, level=.99) # 99% CI
# 95% CI for Acid.Conc. only
confint(stacklossModel, "Acid.Conc.", level=.95)
 
# default regression plots in R
plot(stacklossModel)

image
# multicolinearity
library(car) # John Fox's car package
VIF <- vif(stacklossModel)
VIF
sqrt(VIF)
# Variance Inflation Factor (VIF)
# The increase in the ***variance*** of an regression ceoff. due to colinearity
# NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be
# if there were no correlations with the other predictors in the model
# NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE)
Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant
Tolerance
# NOTE: you can inspect multicolinearity in the multiple regression mode
# by conducting a Principal Component Analysis over the predictors;
# when the time is right.
 
#### R for partial and part (semi-partial) correlations
library(ppcor) # a good one; there are many ways to do this in R
 
#### partial correlation in R
dataSet <- iris
str(dataSet)
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method="pearson")
irisPCor$estimate # partial correlations
irisPCor$p.value # results of significance tests
irisPCor$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned
# partial correlation between x and y while controlling for z
partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
                     dataSet$Sepal.Width,
                     method = "pearson")
partialCor$estimate
partialCor$p.value
partialCor$statistic
 
# NOTE:
# Formally, the partial correlation between X and Y given a set of n 
# controlling variables Z = {Z1, Z2, ..., Zn}, written ρXY·Z, is the 
# correlation between the residuals RX and RY resulting from the 
# linear regression of X with Z and of Y with Z, respectively. 
# The first-order partial correlation (i.e. when n=1) is the difference 
# between a correlation and the product of the removable correlations 
# divided by the product of the coefficients of alienation of the 
# removable correlations. [source: https://en.wikipedia.org/wiki/Partial_correlation]
# NOTE: coefficient of alienation = 1 - R2 (R2 = "r-squared")
# coefficient of alienation = the proportion of variance "unaccounted for"
 
#### semi-partial correlation in R
# NOTE: ... Semi-partial correlation is the correlation of two variables 
# with variation from a third or more other variables removed only 
# from the ***second variable***
# NOTE: The first variable <- rows, the second variable <-columns
# cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015)
# Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University
# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/
irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
irisSPCor$p.value
irisSPCor$statistic
 
# NOTE: ... Semi-partial correlation is the correlation of two variables 
# with variation from a third or more other variables removed only 
# from the ***second variable***
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
                       dataSet$Sepal.Width,
                       method = "pearson")
# NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length
# when the variance of dataSet$Petal.Length (2nd variable) due to dataSet$Sepal.Width
# is removed!
partCor$estimate
partCor$p.value
partCor$statistic
 
# NOTE: In multiple regression, this is the semi-partial (or part) correlation
# that you need to inspect:
# assume a model with X1, X2, X3 as predictors, and Y as a criterion
# You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y
# It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
# In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate)
# RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation
# that you're looking for.
 
# Give a thought to the following discussion on categorical predictors:
# http://stats.stackexchange.com/questions/133203/partial-correlation-and-multiple-regression-controlling-for-categorical-variable
# What's your take on this?


Readings :: Session 8: Logistic Regression [16. June, 2016, @Startit.rs, 19h CET]

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)