**R for Public Health**, and kindly contributed to R-bloggers)

In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed.

```
library(ggplot2)
library(gridExtra)
set.seed(10005)
xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)
```

### >> Histograms

I’ve already done a post on histograms using base R, so I won’t spend too much time on them. Here are the basics of doing them in ggplot. More on all options for histograms here.

The R cookbook has a nice page about it too: **http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/**

Also, I found this really great aggregation of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot.

`#counts on y-axis g1<-ggplot(xy, aes(xvar)) + geom_histogram() #horribly ugly default g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1) #change binwidth g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw() #nicer looking #density on y-axis g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw() grid.arrange(g1, g2, g3, g4, nrow=1)`

`## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust ## this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to ## adjust this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' ## to adjust this.`

Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space.

### >> Density plots

We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including **kernel=“epanechnikov”** and **kernel=“rectangular”** or whatever you want. You can find all of those options here.

```
#basic density
p1<-ggplot(xy, aes(xvar)) + geom_density()
#histogram with density line overlaid
p2<-ggplot(xy, aes(x=xvar)) +
geom_histogram(aes(y = ..density..), color="black", fill=NA) +
geom_density(color="blue")
#split and color by third variable, alpha fades the color a bit
p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2)
grid.arrange(p1, p2, p3, nrow=1)
```

### >> Boxplots and more

We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a volcano plot. More on boxplots here. Note that I removed the legend from each one because it is redundant.

```
#boxplot
b1<-ggplot(xy, aes(zvar, xvar)) +
geom_boxplot(aes(fill = zvar)) +
theme(legend.position = "none")
#jitter plot
b2<-ggplot(xy, aes(zvar, xvar)) +
geom_jitter(alpha=I(1/4), aes(color=zvar)) +
theme(legend.position = "none")
#volcano plot
b3<-ggplot(xy, aes(x = xvar)) +
stat_density(aes(ymax = ..density.., ymin = -..density..,
fill = zvar, color = zvar),
geom = "ribbon", position = "identity") +
facet_grid(. ~ zvar) +
coord_flip() +
theme(legend.position = "none")
grid.arrange(b1, b2, b3, nrow=1)
```

### >> Putting multiple plots together

Finally, it’s nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this StackOverflow page.

One way to do this is to add distribution information to a scatterplot as a “rug plot”. It adds a little tick mark for every point in your data projected onto the axis.

```
#rug plot
ggplot(xy,aes(xvar,yvar)) + geom_point() + geom_rug(col="darkred",alpha=.1)
```

Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder.

I do the density plots by the zvar variable to highlight the differences in the two groups.

```
#placeholder plot - prints nothing at all
empty <- ggplot()+geom_point(aes(1,1), colour="white") +
theme(
plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
#scatterplot of x and y variables
scatter <- ggplot(xy,aes(xvar, yvar)) +
geom_point(aes(color=zvar)) +
scale_color_manual(values = c("orange", "purple")) +
theme(legend.position=c(1,1),legend.justification=c(1,1))
#marginal density of x - plot on top
plot_top <- ggplot(xy, aes(xvar, fill=zvar)) +
geom_density(alpha=.5) +
scale_fill_manual(values = c("orange", "purple")) +
theme(legend.position = "none")
#marginal density of y - plot on the right
plot_right <- ggplot(xy, aes(yvar, fill=zvar)) +
geom_density(alpha=.5) +
coord_flip() +
scale_fill_manual(values = c("orange", "purple")) +
theme(legend.position = "none")
#arrange the plots together, with appropriate height and width for each row and column
grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))
```

It’s really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in **theme(axis.title.x = element_blank())** in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first ggplot2 cheatsheet for scatterplots if you need a refresher.

**leave a comment**for the author, please follow the link and comment on his blog:

**R for Public Health**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...