My Commonly Done ggplot2 graphs: Part 2

[This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I described some of my commonly done ggplot2 graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently.

Scatterplot colored by continuous variable

The setup of the data for the scatterplots will be the same as the previous post, one x variable and one y variable.

library(ggplot2)
set.seed(20141106)
data = data.frame(x = rnorm(1000, mean=6), 
                  batch = factor(rbinom(1000, size=4, prob = 0.5)))
data$group1 = 1-rbeta(1000, 10, 2)
mat = model.matrix(~ batch, data=data)
mat = mat[, !colnames(mat) %in% "(Intercept)"]
betas = rbinom(ncol(mat), size=20, prob = 0.5)
data$quality = rowSums(t(t(mat) * sample(-2:2)))
data$dec.quality = cut(data$quality, 
                       breaks = unique(quantile(data$quality, 
                                         probs = seq(0, 1, by=0.1))),
                       include.lowest = TRUE)

batch.effect = t(t(mat) * betas)
batch.effect = rowSums(batch.effect)
data$y = data$x * 5 + rnorm(1000) + batch.effect  + 
  data$quality * rnorm(1000, sd = 2)

data$group2 = runif(1000)

I have added 2 important new variables, quality and batch. The motivation for these variables is akin to an RNAseq analysis set where you have a quality measure like read depth, and where the data were processed in different batches. The y variable is based both on the batch effect and the quality.

We construct the ggplot2 object for plotting x against y as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()
print(g)

plot of chunk g_create

Coloring by a 3rd Variable (Discrete)

Let's plot the x and y data by the different batches:

print({ g + aes(colour=batch)})

plot of chunk gbatch

We can see from this example how to color by another third discrete variable. In this example, we see that the relationship is different by each batch (each are shifted).

Coloring by a 3rd Variable (Continuous)

Let's color by quality, which is continuous:

print({ gcol = g + aes(colour=quality)})

plot of chunk quality

The default option is to use black as a low value and blue to be a high value. I don't always want this option, as I cannot always see differences clearly. Let's change the gradient of low to high values using scale_colour_gradient:

print({ gcol + scale_colour_gradient(low = "red", high="blue") })

plot of chunk qual_col

This isn't much better. Let's call the middle quality gray and see if we can see better separation:

print({ gcol_grad = gcol + 
          scale_colour_gradient2(low = "red", mid = "gray", high="blue") })

plot of chunk qual_col_mid

Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Another option is to break the quality into deciles (before plotting) and then coloring by these as a discrete variable:

print({ gcol_dec = g + aes(colour=dec.quality) })

plot of chunk decqual

Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

We can combine these to show each batch in different facets and coloring by quality:

print({ gcol_grad + facet_wrap(~ batch )})

plot of chunk decqual_batch

We can compound all these operations by passing transformations to scale_colour_gradient such as scale_colour_gradient(trans = "sqrt"). But enough with scatterplots.

Distributions! Lots of them.

One of the gaping holes in my last post was that I did not do any plots of distributions/densities of data. I ran the same code from the last post to get the longitudinal data set named dat.

Histograms

Let's say I want a distribution of my yij variable for each person across times:

library(plyr)
g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
ghist = g+ geom_histogram(binwidth = 3) 
print(ghist)

plot of chunk unnamed-chunk-1

Hmm, that's not too informative. By default, the histograms stack on top of each other. We can change this by setting position to be identity:

ghist = g+ geom_histogram(binwidth = 3, position ="identity") 
print(ghist)

plot of chunk ghist_ident

There are still too many histograms. Let's plot a subset.

Aside: Using the %+% operator

The %+% operator allows you to reset what dataset is in the ggplot2 object. The data must have the same components (e.g. variable names); I think this is most useful for plotting subsets of data.

nobs = 10
npick = 5

Let's plot the density of (5) people people with (10) or more observations both using geom_density and geom_line(stat = "density"). We will also change the binwidth:

tab = table(dat$id)
ids = names(tab)[tab >= nobs]
ids = sample(ids, npick)
sub = dat[ dat$id %in% ids, ]
ghist = g+ geom_histogram(binwidth = 5, position ="identity") 
ghist %+% sub

plot of chunk sub

Overlaid Histograms with Alpha Blending

Let's alpha blend these histograms to see the differences:

ggroup = ggplot(data=sub, aes(x=yij, fill=factor(id))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) 
grouphist

plot of chunk unnamed-chunk-2

Similarly, we can plot over the 3 groups in our data:

ggroup = ggplot(data=dat, aes(x=yij, fill=factor(group))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) 
grouphist

plot of chunk unnamed-chunk-3

These histograms are something I commonly do when I want overlay the data in some way. More commonly though, espeically with MANY distributions, I plot densities.

Densities

We can again plot the distribution of (y_{ij}) for each person by using kernel density estimates, filled a different color for each person:

g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
print({gdens = g+ geom_density() })

plot of chunk gdens

As the filling overlaps a lot and blocks out other densities, we can use just different colors per person/id/group:

g = ggplot(data=dat, aes(x=yij, colour=factor(id))) +   guides(colour=FALSE)
print({gdens = g+ geom_density() })

plot of chunk colourdens

I'm not a fan that the default for stat_density is that the geom = "area". This creates a line on the x-axis that closes the object. This is very important if you want to fill the density with different colors. Most times though, I want simply the line of the density with no bottom line. We can achieve this with:

print({gdens2 = g+ geom_line(stat = "density")})

plot of chunk gdens2

What if we set the option to fill the lines now? Well lines don't take fill, so it will not colour each line differently.

gdens3 = ggplot(data=dat, aes(x=yij, fill=factor(id))) + geom_line(stat = "density") +  guides(colour=FALSE)
print({gdens3})

plot of chunk gdens3

Now, regardless of the coloring, you can't really see the difference in people since there are so many. What if we want to do the plot with a subset of the data and the object is already constructed? Again, use the %+% operator.

Overlaid Densities with Alpha Blending

Let's take just different subsets of groups, not people, and plot the densities, with alpha blending:

print({ group_dens = ggroup+ geom_density(alpha = 0.3) })

plot of chunk group_dens

That looks much better than the histogram example for groups. Now let's show these with lines:

print({group_dens2 = ggroup+ geom_line(stat = "density")})

plot of chunk group_dens2

What happened? Again, lines don't take fill, they take colour:

print({group_dens2 = ggroup+ geom_line(aes(colour=group), stat = "density")})

plot of chunk group_dens3

I'm a firm believer of legends begin IN plots, so let's push that in there and make it blend in:

print({
  group_dens3 =
  group_dens2 +  theme(legend.position = c(.75, .75),
        legend.background = element_rect(fill="transparent"),
        legend.key = element_rect(fill="transparent", 
                                  color="transparent"))
})

plot of chunk group_dens_legend

Lastly, I'll create a dataset of the means of the datasets and put vertical lines for the mean:

gmeans = ddply(dat, .(group), summarise,
              mean = mean(yij))
group_dens3 + geom_vline(data=gmeans, 
                         aes(colour = group, xintercept = mean))

plot of chunk group_mean

Conclusion

Overall, this post should give you a few ways to play around with densities and such for plotting. All the same changes as the previous examples with scatterplots, such as facetting, can be used with these distribution plots. Many times, you want to look at the data in very different ways. Histograms can allow you to see outliers in some ways that densities do not because they smooth over the data. Either way, the mixture of alpha blending, coloring, and filling (though less useful for many distributions) can give you a nice description of what's going on a distributional level in your data.

PS: Boxplots

You can also do boxplots for each group, but these tend to separate well and colour relatively well using defaults, so I wil not discuss them here. My only note is that you can (and should) overlay points on the boxplot rather than just plot the histogram. You may need to jitter the points, alpha blend them, subsample the number of points, or clean it up a bit, but I try to display more DATA whenever possible.


To leave a comment for the author, please follow the link and comment on their blog: A HopStat and Jump Away » Rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)