Boxplots or raw data graphs?

October 14, 2010

(This article was first published on Social data blog, and kindly contributed to R-bloggers)

We recently had a dilemma for an OSI publication about the design for the graphs. There will be dozens of these graphs showing the mean score on a given variable for nearly 11000 parents from 10 countries. This example is for household wealth which has values ranging from 0 to 16. These are the three alternative designs we considered, all constructed with the wonderful ggplot2.

My personal favourite is the first as all of the 10 thousand persons in the database is represented by a dot. No information is lost. The means are shown by larger dots.

The second option was preferred by many because it looks more familiar. However I had to disallow it because although they look like boxplots, actually the centre line is the mean and the height of the box is two standard deviations, whereas for a boxplot that should be the median and the interquartile range.

So we settled on the third option though I had to tinker a bit with the code because some of the standard deviations actually exceed the range of the y-axis – the kind of problem you wouldn’t have with the first option.

See the full gallery on posterous


| Leave a comment  »

To leave a comment for the author, please follow the link and comment on their blog: Social data blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)