Site icon R-bloggers

LR02: SD line, GoA, Regression

[This article was first published on intubate <||>, XBRL, bioPN, sbioPN, and stats with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This posts continues the discussion of correlation started on LR01: Correlation. We will try to answer the following questions: Should correlation be used for any pair of data? Does association mean causation? What are ecological correlations? What happens with the scatter diagram when we change the standard deviations of x and y? What is the SD line? What is the Graph of averages? What is the Regression line. What is the Regression function?



What we know so far

[ r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}} ]

library(UsingR)

## Pearson's data
## Father and son data
data(father.son)
x <- father.son$fheight
y <- father.son$sheight
## Summary statistics
(meanx <- mean(x))
## [1] 67.6871
sd(x)
## [1] 2.744868
(meany <- mean(y))
## [1] 68.68407
sd(y)
## [1] 2.814702
(r <- cor(x, y))
## [1] 0.5013383


Some situations where correlation should not be used


Association is not causation


Ecological correlations (misleading use of correlation)

eco <- read.csv(file = "../data/LR/eco_corre.csv", stringsAsFactors = FALSE)

par(mfrow = c(1,2))

## Left panel
r <- round(cor(eco$math, eco$verbal), 2)
plot(eco$math, eco$verbal, col = eco$group,
     xlim = c(400, 800), ylim = c(400, 800), main = paste0("r1= ", r))

## Right panel
byg <- with(eco, data.frame(math = tapply(math, group, mean),
                            verbal = tapply(verbal, group, mean)))
rg <- round(cor(byg$math, byg$verbal), 2)
plot(byg$math, byg$verbal, col = row.names(byg), pch = 19, cex = 2,
     xlim = c(400, 800), ylim = c(400, 800), main = paste0("r2= ", rg))

The same, using magrittr pipes, dplyr and intubate (See intubate <||> R stat functions in data science pipelines for an introduction to intubate):

library(magrittr)
library(dplyr)
library(intubate)

eco <- read.csv(file = "../data/LR/eco_corre.csv", stringsAsFactors = FALSE)

par(mfrow = c(1,2))

## Left panel
eco %>% ntbt(cor, math, verbal) %>% round(2) -> r
eco %>% 
  ntbt(plot, math, verbal, col = group, main = paste0("r1= ", r),
       xlim = c(400, 800), ylim = c(400, 800))

## Right panel
eco %>%
  group_by(group) %>%
  summarise(math = mean(math),
            verbal = mean(verbal)) -> byg
byg %>% ntbt(cor, math, verbal) %>% round(2) -> rg
byg %>% 
  ntbt(plot, math, verbal, col = group, main = paste0("r2= ", rg),
       xlim = c(400, 800), ylim = c(400, 800), pch = 19, cex = 2)

eco %>% 
  ntbt(plot, math, verbal, col = group,
       main = paste0("r1= ", r, "; r2= ", rg),
       xlim = c(400, 800), ylim = c(400, 800))
byg %>% 
  ntbt(points, math, verbal, col = group, pch = 19, cex = 2)



Changing SDs


Discussion on calculation of the standard deviation

For simplicity, FPP use, unless it was a very small sample, SD instead of sd. SD is obtained by dividing by $n$ instead of by $n – 1$ (R has the function sd for the latter, and no function for the former). Anyway, FPP clearly note (later for not confusing people with technicalities on a book with just a few formulas), that for small samples one should divide by $n – 1$.

To satisfy people who considers this an important issue, let us mention that the correct one is the one obtained by dividing by $n-1$ (because the associated sample variance is an unbiased estimator of the population variance, while the other is biased). On the other hand, if the points are the population (not a sample from it), the correct is the one dividing by $n$.

However, the difference is, on cases say with about 40-50 data points or more, negligible (if you are using the right kind of data, which is what you should be doing), and in the case of Pearson’s data (1,078) the terrible error committed by using SD instead of sd is:

## FPP version of sample standard deviation
SD <- function(x) sqrt(mean((x-mean(x))^2))

## "Error" of using SD instead of "correct" sd
SD(x) - sd(x)
## [1] -0.001273425
SD(y) - sd(y)
## [1] -0.001305823

Is that 1/1000 of an inch?

Perhaps, there are far more worse errors one can commit (using data unsuitable to the statistical procedure comes to mind, or the other way around, using an unsuitable statistical procedure for the data that, after all, should drive the analysis), and insisting too much in technicalities such as this can make statistics rely too much in mathematical niceties, and too little on substance (just in case, know that I do not expect that you agree with me in any kind of remark I make. In fact, many of my remarks are just a consequence of my lack of certainty about which should be the correct answer).

By the way, the maximum likelihood estimator of the variance also divides by $n$ instead of $n-1$, so it is biased (and used).

We will show both approaches, that lead mostly to the same final results, pointing out when they disagree.


The SD line

Finally, the SD line using $\text{SD}$ is:

[ y=\frac{\text{SD}_y}{\text{SD}_x} x+\bar{y}-\frac{\text{SD}_y}{\text{SD}_x}\bar{x} ]

where $\frac{\text{SD}_y}{\text{SD}_x}$ is the slope and $\bar{y}-\frac{\text{SD}_y}{\text{SD}_x}\bar{x}$ is the intercept.

Of course, the same line is obtained using $\text{sd}$ instead of $\text{SD}$ (why?):

[ y=\frac{\text{sd}_y}{\text{sd}_x} x+\bar{y}-\frac{\text{sd}_y}{\text{sd}_x}\bar{x} ]

where $\frac{\text{sd}_y}{\text{sd}_x}$ is the slope and $\bar{y}-\frac{\text{sd}_y}{\text{sd}_x}\bar{x}$ is the intercept.

## SD line using equation and FFP SD
abline(a = meany - SD(y)/SD(x)*meanx,
       b = SD(y)/SD(x), col = "blue", lwd = 4)

## SD line using equation and sd
abline(a = meany - sd(y)/sd(x)*meanx,
       b = sd(y)/sd(x), col = "blue", lwd = 4)


The graph of averages

## Graph of averages.
sgav <- with(father.son, tapply(sheight, round(fheight, 0), mean))
sgavnum <- with(father.son, tapply(sheight, round(fheight, 0), length))

points(as.numeric(names(sgav)), sgav, col = "red", pch = 16)
text(as.numeric(names(sgav)), sgav, sgavnum, pos = 3)


The Regression line and the Regression function

[ y = \frac{\text{SD}_y}{\text{SD}_x}x + \bar{y} – \frac{\text{SD}_y}{\text{SD}_x}\bar{x} ]

While you think we will put everything in one plot, showing the equation of the regression line:

abline(a = meany - r*sd(y)/sd(x)*meanx, b = r*sd(y)/sd(x),
       lwd = 4, col = "red")


Appendix

Complete code for ecological correlations

## Generating correlated Normal data
diffr <- function(n, rho, SDx = 1, SDy = 1) {
  meanx <- 3; meany <- 3
  
  x1 <- rnorm(n = n)
  x2 <- rnorm(n = n)
  x3 <- rho*x1 + sqrt(1-rho^2)*x2
  
  x <- meanx + SDx*x1
  y <- meany + SDy*x3
  
  r <- round(cor(x, y), 3)
  
  plot(x, y, xlim = c(0,6), ylim = c(0,6),
       xaxs = "i", yaxs = "i")
}
set.seed(1)
par(mai = c(.2, .2, .2, .2), mgp = c(1.5, 0.2, 0),
    tck = -.01, mfrow = c(1,2))
diffr(rho = 0.70, n = 50, SDx = 1.1, SDy = .3)
diffr(rho = 0.70, n = 50, SDx = 1, SDy = 1.3)

Complete code to produce the SD line plot

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80),
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. SD line",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Spread of the cloud
abline(v = meanx + c(-3, -2, -1, 1, 2, 3)*sd(x), col = "blue", lty = 3)
abline(h = meany + c(-3, -2, -1, 1, 2, 3)*sd(y), col = "blue", lty = 3)

## Point of averages (center of the cloud)
abline(v = meanx, col = "green")
abline(h = meany, col = "green")

## SD line using equation and FFP SD
abline(a = meany - SD(y)/SD(x)*meanx,
       b = SD(y)/SD(x), col = "blue", lwd = 4)

## SD line using equation and sd
abline(a = meany - sd(y)/sd(x)*meanx,
       b = sd(y)/sd(x), col = "blue", lwd = 4)

Complete code to produce points in relation to the SD line plot

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80), col = "lightgrey",
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. SD line",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Spread of the cloud
abline(v = meanx + c(-2, 2)*sd(x), col = "blue", lty = 3)
abline(h = meany + c(-2, 2)*sd(y), col = "blue", lty = 3)

## Point of averages (center of the cloud)
abline(v = meanx, col = "green")
abline(h = meany, col = "green")

## SD line using equation and sd
abline(a = meany - sd(y)/sd(x)*meanx,
       b = sd(y)/sd(x), col = "blue", lwd = 4)

fround <- 72
abline(v = fround + c(-.5,.5), lty = 3)
with(subset(father.son, round(fheight,0) == fround),
     points(fheight, sheight, pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y),
                         "darkgreen", "blue")))

fround <- 64
abline(v = fround+c(-.5, .5), lty = 3)
with(subset(father.son, round(fheight,0) == fround),
     points(fheight, sheight, pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y),
                         "darkgreen", "blue")))

Complete code to produce the Graph of Averages plot

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80), col = "lightgrey",
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. GoA",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Point of averages (center of the cloud)
abline(v = meanx, col = "green")
abline(h = meany, col = "green")

## Graph of averages.
sgav <- with(father.son, tapply(sheight, round(fheight, 0), mean))
sgavnum <- with(father.son, tapply(sheight, round(fheight, 0), length))

points(as.numeric(names(sgav)), sgav, col = "red", pch = 16)
text(as.numeric(names(sgav)), sgav, sgavnum, pos = 3)

Complete code to produce the Graph of Averages + SD line plot

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80), col = "lightgrey",
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. SD line and GoA",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Point of averages (center of the cloud)
abline(v=meanx, col="green")
abline(h=meany, col="green")

## SD line using equation and sd
abline(a = meany - sd(y)/sd(x)*meanx,
       b = sd(y)/sd(x), col = "blue", lwd = 4)

## Graph of averages.
sgav <- with(father.son, tapply(sheight, round(fheight, 0), mean))
points(as.numeric(names(sgav)), sgav, col = "red", pch = 16)

Complete code to produce the Regression line plot

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80), col = "lightgrey",
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. Regression line and GoA",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Point of averages (center of the cloud)
abline(v = meanx, col = "green")
abline(h = meany, col = "green")

## Graph of averages.
sgav <- with(father.son, tapply(sheight, round(fheight,0), mean))
points(as.numeric(names(sgav)), sgav, col = "red", pch = 16)

## Regression line
abline(a = meany - r*sd(y)/sd(x)*meanx, b = r*sd(y)/sd(x),
       lwd = 4, col = "red")

fround <- 72
abline(v = fround + c(-.5, .5), lty = 3)
with(subset(father.son, round(fheight, 0) == fround),
     points(fheight, sheight,  pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y)*r,
                         "darkgreen", "blue")))

fround <- 64
abline(v = fround + c(-.5, .5), lty = 3)
with(subset(father.son, round(fheight, 0) == fround),
     points(fheight, sheight, pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y)*r,
                         "darkgreen", "blue")))

Complete code to produce the plot with everything

plot(x, y,
     xlim = c(58, 80), ylim = c(58, 80), col = "lightgrey",
     xaxt = "n", yaxt = "n", xaxs = "i", yaxs = "i",
     main = "Pearson's data. SD line, regression line and GoA",
     xlab = "Father height in inches", ylab = "Son height in inches")
axp <- seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

## Point of averages (center of the cloud)
abline(v = meanx, col = "green")
abline(h = meany, col = "green")

## SD line using equation and sd
abline(a = meany - sd(y)/sd(x)*meanx,
       b = sd(y)/sd(x), col = "blue", lwd = 4)

## Graph of averages.
sgav <- with(father.son, tapply(sheight, round(fheight, 0), mean))
points(as.numeric(names(sgav)), sgav, col = "red", pch = 16)

## Regression line
abline(a = meany - r*sd(y)/sd(x)*meanx, b = r*sd(y)/sd(x),
       lwd = 4, col = "red")

fround <- 72
abline(v = fround + c(-.5, .5), lty = 3)
with(subset(father.son, round(fheight, 0) == fround),
     points(fheight, sheight, pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y)*r,
                         "darkgreen", "blue")))

fround <- 64
abline(v = fround + c(-.5, .5), lty = 3)
with(subset(father.son, round(fheight, 0) == fround),
     points(fheight, sheight, pch = 16,
            col = ifelse(sheight > meany + (fheight - meanx)/sd(x)*sd(y)*r,
                         "darkgreen", "blue")))

Previous

LR01: Correlation

Related

intubate <||> R stat functions in data science pipelines

To leave a comment for the author, please follow the link and comment on their blog: intubate <||>, XBRL, bioPN, sbioPN, and stats with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.