Welcome to Introduction to R for Data Science Session 6: Linear Regression + EDA, and Normality tests [Linear Regression in R: Exploratory Data Analysis, assumptions of the simple linear model, correlation, and visualization. Predictions from the linear model. Confidence Intervals and Residuals. Inspecting the basic linear model. Infulential cases and the Influence Plot.]
The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.
Check out the Course Overview to acess the learning material presented thus far.
Data Science Serbia Course Pages [in Serbian]
Startit Course Pages [in Serbian]
Lecturers
- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
- Goran S. Milovanović, Phd, [email protected], Data Science Serbia
Summary of Session 6, 02. June 2016 :: Linear Regression + EDA, and Normality tests.
Linear Regression + EDA and Normality tests. Linear Regression in R: Exploratory Data Analysis, assumptions of the simple linear model, correlation, and visualization. Predictions from the linear model. Confidence Intervals and Residuals. Inspecting the basic linear model. Infulential cases and the Influence Plot.
Intro to R for Data Science SlideShare :: Session 6
R script :: Session 6
######################################################## # Introduction to R for Data Science # SESSION 6 :: 2 June, 2016 # Simple Linear Regression in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # clear rm(list=ls()) #### read data library(datasets) data(iris) ### iris data set description: # https://stat.ethz.ch/R-manual/R-devel/library/iriss/html/iris.html ### Exploratory Data Analysis (EDA) str(iris) summary(iris) ### EDA plots # plot layout: 2 x 2 par(mfcol = c(2,2)) # boxplot iris$Sepal.Length boxplot(iris$Sepal.Length, horizontal = TRUE, xlab="Sepal Length") # histogram: iris$Sepal.Length hist(iris$Sepal.Length, main="", xlab="Sepal.Length", prob=T) # overlay iris$Sepal.Length density function over the empirical distribution lines(density(iris$Sepal.Length), lty="dashed", lwd=2.5, col="red") # boxplot iris$Petal.Length boxplot(iris$Petal.Length, horizontal = TRUE, xlab="Petal Length") # histogram: iris$Petal.Length, hist(iris$Petal.Length, main="", xlab="Petal Length", prob=T) # overlay iris$Petal.Length density function over the empirical distribution lines(density(iris$Petal.Length), lty="dashed", lwd=2.5, col="red")
# NOTE: Boxplot "fences" and outlier detection # Boxplot in R recognizes as outliers those data points that are found beyond OUTTER fences # Source: http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm # Q3 = 75 percentile, Q1 = 25 percentile # IQ = Q3 - Q1; Interquartile range # lower inner fence: Q1 - 1.5*IQ # upper inner fence: Q3 + 1.5*IQ # lower outer fence: Q1 - 3*IQ # upper outer fence: Q3 + 3*IQ # A point beyond an inner fence on either side is considered a mild outlier # A point beyond an outer fence is considered an extreme outlier # plot variable density in general: Sepal Width # plot layout par(mfcol = c(1,2)) # NOTE: this is kernel density estimation in R. You are not testing any distribution yet. PLengthDensity <- density(iris$Petal.Length) plot(PLengthDensity, main="Petal Length Probability Density", xlab = "Petal Length",cex.main = .7) polygon(PLengthDensity, col="red", border="black") ### Step 1: Pearson Correlation ## check whether the assumption of normality holds: Sepal Widht # NOTE: check the documentation of the ks.test() # URL: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html ksPLength <- ks.test(iris$Petal.Length, "pnorm", mean(iris$Petal.Length), sd(iris$Petal.Length), alternative = "two.sided", exact = NULL) ksPLength # According to KS, Petal.Length is not Gaussian # NOTE: If a single-sample test is used, the parameters specified in ... # must be pre-specified and not estimated from the data. {TRUE} # Thus what we're doing here is not acceptable. Remember how many times have you seen # the KS test applied in this manner? Imagine... # NOTE: Gaussian Density in R # All R distribution and density functions are similar # dnorm(x, mean = 0, sd = 1, log = FALSE) for density (PDF) # pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) for distribution function (CDF) # NOTE: lower.tail = FALSE --> P[X>x], survivor (ie. decumulative probability) function # qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) (quantiles) # rnorm(n, mean = 0, sd = 1) # random deviates # plot variable density in general: Sepal Length SLengthDensity <- density(iris$Sepal.Length) plot(SLengthDensity, main="Sepal Length Probability Density", xlab = "Sepal Length", cex.main = .7) polygon(SLengthDensity, col="cadetblue", border="black")
## check whether the assumption of normality holds: Sepal Length ksSLength <- ks.test(iris$Sepal.Length, "pnorm", mean(iris$Sepal.Length), sd(iris$Sepal.Length), alternative = "two.sided", exact = NULL) ksSLength # KS test says this is a normal distribution ## AGAIN: check whether the assumption of normality holds: Petal.Lenght # Shapiro-Wilk Test :: better for small samples + for Normal distribution only! # NOTE: The most powerful statistical test given for a given alpha and compared # to other tests similar in purpose! swPLength <- shapiro.test(iris$Petal.Length) swPLength # Null hypothesis: the data are normally distributed... swPLength$statistic swPLength$p # ... - while SW test says they are not normally distributed. ## Again: check whether the assumption of normality holds: Sepal Length swSLength <- shapiro.test(iris$Sepal.Length) swSLength # Null hypothesis: the data are normally distributed... swSLength$statistic swSLength$p # ... - and to SW test they do not seem to to be. # Take-home message: good luck with normality tests # more plots to check the normality assumption theoNormSLength <- qnorm(ppoints(iris$Sepal.Length), mean(iris$Sepal.Length), sd(iris$Sepal.Length)) # ?ppoints - produces evenly spaced probability points qqplot(theoNormSLength,iris$Sepal.Length, main = "Q-Q Plot: Sepal Length", xlab = "Theoretical Quantiles", ylab = "Empirical Quantiles", xlim = c(min(theoNormSLength),max(theoNormSLength)), ylim = c(min(theoNormSLength),max(theoNormSLength)), cex.main = .7 ) abline(0,1,col="red")
# ... and more plots to check the normality assumption theoNormPLength <- qnorm(ppoints(iris$Petal.Length), mean(iris$Petal.Length), sd(iris$Petal.Length)) qqplot(theoNormPLength,iris$Petal.Length, main = "Q-Q Plot: Petal Length", xlab = "Theoretical Quantiles", ylab = "Empirical Quantiles", xlim = c(min(theoNormPLength),max(theoNormPLength)), ylim = c(min(theoNormPLength),max(theoNormPLength)), cex.main = .7 ) abline(0,1,col="cadetblue")
## Pearson correlation in R {base} cor1 <- cor(iris$Sepal.Length, iris$Petal.Length, method="pearson") cor1 par(mfcol = c(1,1)) plot(iris$Sepal.Length, iris$Petal.Length, main = "Sepal Length vs Petal Length", xlab = "Sepal Length", ylab = "Petal Length") ## Correlation matrix and treatment of missing data dSet <- iris # Remove one discrete variable dSet$Species <- NULL # introduce NA in dSet$Sepal.Length[5] dSet$Sepal.Length[5] <- NA # Pairwise and Listwise Deletion: cor1a <- cor(dSet,use="complete.obs") # listwise deletion cor1a cor1a <- cor(dSet,use="pairwise.complete.obs") # pairwise deletion cor1a cor1a <- cor(dSet,use="all.obs") # all observations - error cor1a # A fact about R: cor{base} does not test for statistical significance. True. library(Hmisc) cor2 <- rcorr(iris$Sepal.Length, iris$Petal.Length, type="pearson") cor2$r # correlations cor2$r[1,2] # that's what you need, right cor2$P # significant at cor2$n # num. observations # NOTE: rcorr uses Pairwise deletion! # Correlation matrix cor2a <- rcorr(as.matrix(dSet), type="pearson") # NOTE: as.matrix # select significant at alpha == .05 w <- which(!(cor2a$P<.05),arr.ind = T) cor2a$r[w] <- NA cor2a$P # compare w. cor2a$r # But the assumption of normality seems to fail for one variable! # Use Spearman's rho (essentially, Pearson's R over ranks): cor2a <- rcorr(as.matrix(dSet), type="spearman") # NOTE: as.matrix cor2a # Linear Regression: lm() # Predicting: Petal Length from Sepal Length reg <- lm(Petal.Length ~ Sepal.Length, data=iris) class(reg) summary(reg) coefsReg <- coefficients(reg) coefsReg slopeReg <- coefsReg[2] interceptReg <- coefsReg[1] # Prediction from this model newSLength <- data.frame(Sepal.Length = runif(100, min(iris$Sepal.Length), max(iris$Sepal.Length)) ) # watch the variable names in the new data.frame! predictPLength <- predict(reg, newSLength) predictPLength # Results, more thoroughly regSum <- summary(reg) regSum # Coefficient of determination regSum$r.squared # Residuals regSum$residuals # Coefficients regSum$coefficients # F-statistic regSum$fstatistic # Confidence Intervals confint(reg, level=.95) # 95% CI confint(reg, level=.99) # 99% CI # 95% CI for slope only confint(reg, "price", level=.95) # Standardized regression coefficients {QuantPsych} library(QuantPsyc) lm.beta(reg) # Reminder: standardized regression coefficients are... # What you would obtain upon performing linear regression over standardized variables # z-score in R zSLength <- scale(iris$Sepal.Length, center = T, scale = T); # computes z-score zPLength <- scale(iris$Petal.Length, center = T, scale = T); # again; ?scale # new dSet w. standardized variables dSet <- data.frame(Sepal.Length <- zSLength, Petal.Length <- zPLength); # Linear Regression w. lm() over standardized variables reg1 <- lm(Petal.Length ~ Sepal.Length, data=dSet) summary(reg1) # compare coefficients(reg1)[2] # beta from reg1 lm.beta(reg) # standardized beta w. QuantPscy lm.beta from reg # plots w. {base} and {ggplot2} library(ggplot2) # Predictor vs Criterion {base} plot(iris$Sepal.Length, iris$Petal.Length, main = "Petal Length vs Sepal Length", xlab = "Sepal Length", ylab = "Petal Length" ) abline(reg,col="red")
# Predictor vs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") + geom_smooth(aes(colour = "red"), method='lm') + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none")
# check normality of residuals
resStReg <- rstandard(reg) # get *standardized* residuals from reg qqnorm(resStReg) qqline(resStReg, col="red")
# Predicted vs. residuals {ggplot2} predReg <- predict(reg) # get predictions from reg resReg <- residuals(reg) # get residuals from reg # resStReg <- rstandard(reg) # get residuals from reg plotFrame <- data.frame(predicted = predReg, residual = resReg); ggplot(data = plotFrame, aes(x = predicted, y = residual)) + geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") + geom_smooth(aes(colour = "blue"), method='lm', se=F) + ggtitle("Predicted vs Residual Lengths") + xlab("Predicted Lengths") + ylab("Residual") + theme(legend.position = "none")
## Detect influential cases infReg <- as.data.frame(influence.measures(reg)$infmat) # Cook's Distance: Cook and Weisberg (1982): # values greater than 1 are troublesome wCook <- which(infReg$cook.d>1) # we're fine here # Average Leverage = (k+1)/n, k - num. of predictors, n - num. observations # Also termed: hat values, range: 0 - 1 # see: https://en.wikipedia.org/wiki/Leverage_%28statistics%29 # Various criteria (twice the leverage, three times the average...) wLev <- which(infReg$hat>2*(2/length(iris$price))) # we seem to be fine here too... ## Influence plot infReg <- as.data.frame(influence.measures(reg)$infmat) plotFrame <- data.frame(residual = resStReg, leverage = infReg$hat, cookD = infReg$cook.d) ggplot(plotFrame, aes(y = residual, x = leverage)) + geom_point(size = plotFrame$cookD*100, shape = 1) + ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") + theme(plot.title = element_text(size=8, face="bold")) + ylab("Standardized Residual") + xlab("Leverage")
# {broom} tools to access the linear regression model library(broom) tidy(reg) glance(reg) augment(reg) # deviance is another measure of goodness-of-fit - used w. MLE # more on deviance in Session 8 on Logistic Regression
Readings :: Session 7: Multiple Regression [9. June, 2016, @Startit.rs, 19h CET]
- R Tutorial, by Chi Yau: Chapter: Multiple Linear Regression
- More readings will be announced via the course mailing list.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...