Visually weighted regression in R (à la Solomon Hsiang)

[This article was first published on Nicebread » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[Update 1: Sep 5, 2012: Explore the Magical Data Enhancer by IRES, using this visualization technique]

[Update 2: Sep 6, 2012: See new improved plots, and new R code!

Solomon Hsiang proposed an appealing method for visually displaying the uncertainty in regressions (see his blog [1][2], and also the discussions on the Statistical Modeling, Causal Inference, and Social Science Blog [1][2]).

I implemented the method in R (using ggplot2), and used an additional method of determining the shading (especially concerning Andrew Gelman’s comment that traditional statistical summaries (such as 95% intervals) give too much weight to the edges. In the following I will show how to produce plots like that:

I used following procedure:

  1. Compute smoothers from 1000 bootstrap samples of the original sample (this results in a spaghetti plot)
  2. Calculate a density estimate for each vertical cut through the bootstrapped smoothers. The area under the density curve always is 1, so the ink is constant for each y-slice.
  3. Shade the figure according to these density estimates.

 

Now let’s construct some plots!

The basic scatter plot:

 

 

 

No we show the bootstrapped smoothers (a “spaghetti plot”). Each spaghetti has a low alpha. That means that overlapping spaghettis produce a darker color and already give weight to highly populated regions.

Here is the shading according to the smoother’s density:

 

Now, we can overplot the median smoother estimate for each x value (the “median smoother”):

Or, a visually weighted smoother:

 

 

Finally, we can add the plain linear regression line (which obviously does not refelct the data points very well):

 

At the end of this post is the function that produces all of these plots. The function returns a ggplot object, so you can modify it afterwards, e.g.:

?View Code RSPLUS

vwReg(y~x, df, shade=FALSE, spag=TRUE) + xlab("Implicit power motive") + ylab("Corrugator activity during preparation")

 

Here are two plots with actual data I am working on:

The correlation of both variables is .22 (p = .003).

A) As a heat map (note: the vertical breaks at the left and right end occur due to single data points that get either sampled or not during the bootstrap):

 

 

B) As a spaghetti plot:

 

 

 

Finally, here’s the code (sometimes the code box is collapsed – click the arrow on the top right of the box to open it). Comments and additions are welcome.

?View Code RSPLUS

[Update: I removed the code, as an updated version has been published here (see at the end of the post)]

To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)