ggplot2: Cheatsheet for Visualizing Distributions

February 18, 2014
By

(This article was first published on R for Public Health, and kindly contributed to R-bloggers)

In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed.
library(ggplot2)
library(gridExtra)
set.seed(10005)

xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)

>> Histograms

I've already done a post on histograms using base R, so I won't spend too much time on them. Here are the basics of doing them in ggplot. More on all options for histograms here. The R cookbook has a nice page about it too: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/ Also, I found this really great aggregation of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot.
#counts on y-axis
g1<-ggplot(xy, aes(xvar)) + geom_histogram()                                      #horribly ugly default
g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1)                            #change binwidth
g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw()   #nicer looking

#density on y-axis
g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw()

grid.arrange(g1, g2, g3, g4, nrow=1)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to
## adjust this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x'
## to adjust this.
plot of chunk unnamed-chunk-2 Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space.

>> Density plots

We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including kernel=“epanechnikov” and kernel=“rectangular” or whatever you want. You can find all of those options here.
#basic density
p1<-ggplot(xy, aes(xvar)) + geom_density()

#histogram with density line overlaid
p2<-ggplot(xy, aes(x=xvar)) + 
  geom_histogram(aes(y = ..density..), color="black", fill=NA) +
  geom_density(color="blue")

#split and color by third variable, alpha fades the color a bit
p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2)

grid.arrange(p1, p2, p3, nrow=1)
plot of chunk unnamed-chunk-3

>> Boxplots and more

We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a volcano plot. More on boxplots here. Note that I removed the legend from each one because it is redundant.
#boxplot
b1<-ggplot(xy, aes(zvar, xvar)) + 
  geom_boxplot(aes(fill = zvar)) +
  theme(legend.position = "none")

#jitter plot
b2<-ggplot(xy, aes(zvar, xvar)) + 
  geom_jitter(alpha=I(1/4), aes(color=zvar)) +
  theme(legend.position = "none")

#volcano plot
b3<-ggplot(xy, aes(x = xvar)) +
  stat_density(aes(ymax = ..density..,  ymin = -..density..,
               fill = zvar, color = zvar),
               geom = "ribbon", position = "identity") +
  facet_grid(. ~ zvar) +
  coord_flip() +
  theme(legend.position = "none")

grid.arrange(b1, b2, b3, nrow=1)
plot of chunk unnamed-chunk-4

>> Putting multiple plots together

Finally, it's nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this StackOverflow page. One way to do this is to add distribution information to a scatterplot as a “rug plot”. It adds a little tick mark for every point in your data projected onto the axis.
#rug plot
ggplot(xy,aes(xvar,yvar))  + geom_point() + geom_rug(col="darkred",alpha=.1)
plot of chunk unnamed-chunk-5 Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder. I do the density plots by the zvar variable to highlight the differences in the two groups.
#placeholder plot - prints nothing at all
empty <- ggplot()+geom_point(aes(1,1), colour="white") +
     theme(                              
       plot.background = element_blank(), 
       panel.grid.major = element_blank(), 
       panel.grid.minor = element_blank(), 
       panel.border = element_blank(), 
       panel.background = element_blank(),
       axis.title.x = element_blank(),
       axis.title.y = element_blank(),
       axis.text.x = element_blank(),
       axis.text.y = element_blank(),
       axis.ticks = element_blank()
     )

#scatterplot of x and y variables
scatter <- ggplot(xy,aes(xvar, yvar)) + 
  geom_point(aes(color=zvar)) + 
  scale_color_manual(values = c("orange", "purple")) + 
  theme(legend.position=c(1,1),legend.justification=c(1,1)) 

#marginal density of x - plot on top
plot_top <- ggplot(xy, aes(xvar, fill=zvar)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c("orange", "purple")) + 
  theme(legend.position = "none")

#marginal density of y - plot on the right
plot_right <- ggplot(xy, aes(yvar, fill=zvar)) + 
  geom_density(alpha=.5) + 
  coord_flip() + 
  scale_fill_manual(values = c("orange", "purple")) + 
  theme(legend.position = "none") 

#arrange the plots together, with appropriate height and width for each row and column
grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))
plot of chunk unnamed-chunk-6 It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in theme(axis.title.x = element_blank()) in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first ggplot2 cheatsheet for scatterplots if you need a refresher.

To leave a comment for the author, please follow the link and comment on his blog: R for Public Health.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.