My Commonly Done ggplot2 graphs: Part 2

Posted on December 18, 2014 by strictlystat in R bloggers | 0 Comments

[This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I described some of my commonly done ggplot2 graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently.

Scatterplot colored by continuous variable

The setup of the data for the scatterplots will be the same as the previous post, one x variable and one y variable.

library(ggplot2)
set.seed(20141106)
data = data.frame(x = rnorm(1000, mean=6), 
                  batch = factor(rbinom(1000, size=4, prob = 0.5)))
data$group1 = 1-rbeta(1000, 10, 2)
mat = model.matrix(~ batch, data=data)
mat = mat[, !colnames(mat) %in% "(Intercept)"]
betas = rbinom(ncol(mat), size=20, prob = 0.5)
data$quality = rowSums(t(t(mat) * sample(-2:2)))
data$dec.quality = cut(data$quality, 
                       breaks = unique(quantile(data$quality, 
                                         probs = seq(0, 1, by=0.1))),
                       include.lowest = TRUE)

batch.effect = t(t(mat) * betas)
batch.effect = rowSums(batch.effect)
data$y = data$x * 5 + rnorm(1000) + batch.effect  + 
  data$quality * rnorm(1000, sd = 2)

data$group2 = runif(1000)

I have added 2 important new variables, quality and batch. The motivation for these variables is akin to an RNAseq analysis set where you have a quality measure like read depth, and where the data were processed in different batches. The y variable is based both on the batch effect and the quality.

We construct the ggplot2 object for plotting x against y as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()
print(g)

Coloring by a 3rd Variable (Discrete)

Let's plot the x and y data by the different batches:

print({ g + aes(colour=batch)})

We can see from this example how to color by another third discrete variable. In this example, we see that the relationship is different by each batch (each are shifted).

Coloring by a 3rd Variable (Continuous)

Let's color by quality, which is continuous:

print({ gcol = g + aes(colour=quality)})

The default option is to use black as a low value and blue to be a high value. I don't always want this option, as I cannot always see differences clearly. Let's change the gradient of low to high values using scale_colour_gradient:

print({ gcol + scale_colour_gradient(low = "red", high="blue") })

This isn't much better. Let's call the middle quality gray and see if we can see better separation:

print({ gcol_grad = gcol + 
          scale_colour_gradient2(low = "red", mid = "gray", high="blue") })

Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Another option is to break the quality into deciles (before plotting) and then coloring by these as a discrete variable:

print({ gcol_dec = g + aes(colour=dec.quality) })

Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

We can combine these to show each batch in different facets and coloring by quality:

print({ gcol_grad + facet_wrap(~ batch )})

We can compound all these operations by passing transformations to scale_colour_gradient such as scale_colour_gradient(trans = "sqrt"). But enough with scatterplots.

Distributions! Lots of them.

One of the gaping holes in my last post was that I did not do any plots of distributions/densities of data. I ran the same code from the last post to get the longitudinal data set named dat.

Histograms

Let's say I want a distribution of my yij variable for each person across times:

library(plyr)
g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
ghist = g+ geom_histogram(binwidth = 3) 
print(ghist)

Hmm, that's not too informative. By default, the histograms stack on top of each other. We can change this by setting position to be identity:

ghist = g+ geom_histogram(binwidth = 3, position ="identity") 
print(ghist)

There are still too many histograms. Let's plot a subset.

Aside: Using the %+% operator

The %+% operator allows you to reset what dataset is in the ggplot2 object. The data must have the same components (e.g. variable names); I think this is most useful for plotting subsets of data.

nobs = 10
npick = 5

Let's plot the density of (5) people people with (10) or more observations both using geom_density and geom_line(stat = "density"). We will also change the binwidth:

tab = table(dat$id)
ids = names(tab)[tab >= nobs]
ids = sample(ids, npick)
sub = dat[ dat$id %in% ids, ]
ghist = g+ geom_histogram(binwidth = 5, position ="identity") 
ghist %+% sub

Overlaid Histograms with Alpha Blending

Let's alpha blend these histograms to see the differences:

ggroup = ggplot(data=sub, aes(x=yij, fill=factor(id))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) 
grouphist

Similarly, we can plot over the 3 groups in our data:

ggroup = ggplot(data=dat, aes(x=yij, fill=factor(group))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) 
grouphist

These histograms are something I commonly do when I want overlay the data in some way. More commonly though, espeically with MANY distributions, I plot densities.

Densities

We can again plot the distribution of (y_{ij}) for each person by using kernel density estimates, filled a different color for each person:

g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
print({gdens = g+ geom_density() })

As the filling overlaps a lot and blocks out other densities, we can use just different colors per person/id/group:

g = ggplot(data=dat, aes(x=yij, colour=factor(id))) +   guides(colour=FALSE)
print({gdens = g+ geom_density() })

I'm not a fan that the default for stat_density is that the geom = "area". This creates a line on the x-axis that closes the object. This is very important if you want to fill the density with different colors. Most times though, I want simply the line of the density with no bottom line. We can achieve this with:

print({gdens2 = g+ geom_line(stat = "density")})

What if we set the option to fill the lines now? Well lines don't take fill, so it will not colour each line differently.

gdens3 = ggplot(data=dat, aes(x=yij, fill=factor(id))) + geom_line(stat = "density") +  guides(colour=FALSE)
print({gdens3})

Now, regardless of the coloring, you can't really see the difference in people since there are so many. What if we want to do the plot with a subset of the data and the object is already constructed? Again, use the %+% operator.

Overlaid Densities with Alpha Blending

Let's take just different subsets of groups, not people, and plot the densities, with alpha blending:

print({ group_dens = ggroup+ geom_density(alpha = 0.3) })

That looks much better than the histogram example for groups. Now let's show these with lines:

print({group_dens2 = ggroup+ geom_line(stat = "density")})

What happened? Again, lines don't take fill, they take colour:

print({group_dens2 = ggroup+ geom_line(aes(colour=group), stat = "density")})

I'm a firm believer of legends begin IN plots, so let's push that in there and make it blend in:

print({
  group_dens3 =
  group_dens2 +  theme(legend.position = c(.75, .75),
        legend.background = element_rect(fill="transparent"),
        legend.key = element_rect(fill="transparent", 
                                  color="transparent"))
})

Lastly, I'll create a dataset of the means of the datasets and put vertical lines for the mean:

gmeans = ddply(dat, .(group), summarise,
              mean = mean(yij))
group_dens3 + geom_vline(data=gmeans, 
                         aes(colour = group, xintercept = mean))

Conclusion

Overall, this post should give you a few ways to play around with densities and such for plotting. All the same changes as the previous examples with scatterplots, such as facetting, can be used with these distribution plots. Many times, you want to look at the data in very different ways. Histograms can allow you to see outliers in some ways that densities do not because they smooth over the data. Either way, the mixture of alpha blending, coloring, and filling (though less useful for many distributions) can give you a nice description of what's going on a distributional level in your data.

PS: Boxplots

You can also do boxplots for each group, but these tend to separate well and colour relatively well using defaults, so I wil not discuss them here. My only note is that you can (and should) overlay points on the boxplot rather than just plot the histogram. You may need to jitter the points, alpha blend them, subsample the number of points, or clean it up a bit, but I try to display more DATA whenever possible.

To leave a comment for the author, please follow the link and comment on their blog: A HopStat and Jump Away » Rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

My Commonly Done ggplot2 graphs: Part 2

Scatterplot colored by continuous variable

Coloring by a 3rd Variable (Discrete)

Coloring by a 3rd Variable (Continuous)

Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

Distributions! Lots of them.

Histograms

Aside: Using the %+% operator

Overlaid Histograms with Alpha Blending

Densities

Overlaid Densities with Alpha Blending

Conclusion

PS: Boxplots

Related

Scatterplot colored by continuous variable

Coloring by a 3rd Variable (Discrete)

Coloring by a 3rd Variable (Continuous)

Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

Distributions! Lots of them.

Histograms

Aside: Using the %+% operator

Overlaid Histograms with Alpha Blending

Densities

Overlaid Densities with Alpha Blending

Conclusion

PS: Boxplots

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)