Functions ddply and melt make plotting summary stats in R more tolerable

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function.  One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me.  Below I show the process using the ever-popular iris dataset.  I use the functions ddply and melt to both summarize and restructure the data into a form amenable to plotting.

 length.by.species = ddply(iris, "Species", function (x) quantile(x$Sepal.Length, c(.25,.5,.75)))
> length.by.species
     Species   25% 50% 75%
1     setosa 4.800 5.0 5.2
2 versicolor 5.600 5.9 6.3
3  virginica 6.225 6.5 6.9
length.by.species = melt(length.by.species, variable.name="Quantile",value.name="Sepal.Length")
length.by.species
     Species Quantile Sepal.Length
1     setosa      25%        4.800
2 versicolor      25%        5.600
3  virginica      25%        6.225
4     setosa      50%        5.000
5 versicolor      50%        5.900
6  virginica      50%        6.500
7     setosa      75%        5.200
8 versicolor      75%        6.300
9  virginica      75%        6.900

One thing you can see in my call to ddply is that the main qualitative variable, whose values are used to subset your data frame, is referred to using quotes.  Somehow I find that a bit weird (I’m used to referring to variables without quotes, I suppose!).  Other than that, the syntax for the ddply command is similar enough to the apply family of functions, so no more complaints here.  You can also see that once I call the function, it gives me a nice neat data frame where the quantiles I asked for are columns, and the values of the Species variable represent different rows (or subsets of the data frame).

The melt command is easy enough, simply wanting to know what to call the column that will represent the values in the column titles (Quantile!) and what to call the numeric measure that the values come from (Sepal.Length).

Now that the summary stats are in a “Long” form data frame, with one column representing the numbers, and two columns containing text, it’s just a simple one liner to create a graph (here done in ggplot).  Below I show one line to create a dodged bar graph, and another line to create a dot plot, both showing the 1st to 3rd quantiles of Sepal.Length by Species.

ggplot(length.by.species, aes(y=Sepal.Length, x=Species, fill=Quantile, stat="identity")) + geom_bar(position="dodge")
ggplot(length.by.species, aes(x=Sepal.Length, y=Species, colour=Quantile, stat="identity")) + geom_point(size=4)
bar graph from melted data frame dot plot from melted data frame

Thank you ddply and melt!


To leave a comment for the author, please follow the link and comment on their blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)