Visually weighted regression in R (à la Solomon Hsiang)

August 30, 2012

(This article was first published on Nicebread » R, and kindly contributed to R-bloggers)

[Update 1: Sep 5, 2012: Explore the Magical Data Enhancer by IRES, using this visualization technique]

[Update 2: Sep 6, 2012: See new improved plots, and new R code!

Solomon Hsiang proposed an appealing method for visually displaying the uncertainty in regressions (see his blog [1][2], and also the discussions on the Statistical Modeling, Causal Inference, and Social Science Blog [1][2]).

I implemented the method in R (using ggplot2), and used an additional method of determining the shading (especially concerning Andrew Gelman’s comment that traditional statistical summaries (such as 95% intervals) give too much weight to the edges. In the following I will show how to produce plots like that:

I used following procedure:

  1. Compute smoothers from 1000 bootstrap samples of the original sample (this results in a spaghetti plot)
  2. Calculate a density estimate for each vertical cut through the bootstrapped smoothers. The area under the density curve always is 1, so the ink is constant for each y-slice.
  3. Shade the figure according to these density estimates.


Now let’s construct some plots!

The basic scatter plot:




No we show the bootstrapped smoothers (a “spaghetti plot”). Each spaghetti has a low alpha. That means that overlapping spaghettis produce a darker color and already give weight to highly populated regions.

Here is the shading according to the smoother’s density:


Now, we can overplot the median smoother estimate for each x value (the “median smoother”):

Or, a visually weighted smoother:



Finally, we can add the plain linear regression line (which obviously does not refelct the data points very well):


At the end of this post is the function that produces all of these plots. The function returns a ggplot object, so you can modify it afterwards, e.g.:

?View Code RSPLUS
vwReg(y~x, df, shade=FALSE, spag=TRUE) + xlab("Implicit power motive") + ylab("Corrugator activity during preparation")


Here are two plots with actual data I am working on:

The correlation of both variables is .22 (p = .003).

A) As a heat map (note: the vertical breaks at the left and right end occur due to single data points that get either sampled or not during the bootstrap):



B) As a spaghetti plot:




Finally, here’s the code (sometimes the code box is collapsed – click the arrow on the top right of the box to open it). Comments and additions are welcome.

?View Code RSPLUS
[Update: I removed the code, as an updated version has been published <a title="Visually weighted/ Watercolor Plots, new variants: Please vote!" href="">here</a> (see at the end of the post)]

To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)