Density Plot with ggplot

December 18, 2012
By

(This article was first published on Shifting sands, and kindly contributed to R-bloggers)

This is a follow on from the post Using apply sapply and lappy in R.

The dataset we are using was created like so:


m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)

Three columns of 30 observations, normally distributed with means of 0, 2 and 5. We want a density plot to compare the distributions of the three columns using ggplot.

First let's give our matrix some column names:

colnames(m) <- c('method1', 'method2', 'method3')
head(m)
#         method1    method2  method3
#[1,]  0.06288358  2.7413567 4.420209
#[2,] -0.11240501  3.4126550 4.827725
#[3,]  0.02467713  1.0868087 4.044101

ggplot has a nice function to display just what we were after geom_density and it's counterpart stat_density which has more examples. 

ggplot likes to work on data frames and we have a matrix, so let's fix that first

df <- as.data.frame(m)
df
#       method1    method2  method3
#1   0.06288358  2.7413567 4.420209
#2  -0.11240501  3.4126550 4.827725
#3   0.02467713  1.0868087 4.044101
#4  -0.73854932 -0.4618973 3.668004

Enter stack


What we would really like is to have our data in 2 columns, where the first column contains the data values, and the second column contains the method name. 

Enter the base function stack, which is a great little function giving just what we need:

dfs <- stack(df)
dfs
#        values     ind
#1   0.06288358 method1
#2  -0.11240501 method1
#…
#88  5.55704736 method3
#89  6.40128267 method3
#90  3.18269138 method3

We can see the values are in one column named values, and the method names (the previous column names) are in the second column named ind. We can confirm they have been turned into a factor as well:

is.factor(dfs[,2])
#[1] TRUE

stack has a partner in crime, unstack, which does the opposite:

unstack(dfs)
#       method1    method2  method3
#1   0.06288358  2.7413567 4.420209
#2  -0.11240501  3.4126550 4.827725
#3   0.02467713  1.0868087 4.044101
#4  -0.73854932 -0.4618973 3.668004

Back to ggplot


So, lets try plot our densities with ggplot:

ggplot(dfs, aes(x=values)) + geom_density()

The first argument is our stacked data frame, and the second is a call to the aes function which tells ggplot the 'values' column should be used on the x-axis.

However, our plot is not quite looking how we wish:


Hmm. 

We want to group the values by each method used. To do this we will use the 'ind' column, and we tell ggplot about this by using aes in the geom_density call:

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind))


This is getting closer, but it's not easy to tell each one apart. Let's try colour the different methods, based on the ind column in our data frame.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind))



Looking better. I'd like to have the density regions stand out some more, so will use fill and an alpha value of 0.3 to make them transparent.

ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind, fill=ind), alpha=0.3)



That is much more in line with what I wanted to see. Note that the alpha argument is passed to geom_density() rather than aes().

That's all for now.




To leave a comment for the author, please follow the link and comment on his blog: Shifting sands.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.