Functions ddply and melt make plotting summary stats in R more tolerable

May 15, 2012
By

(This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers)

The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function.  One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me.  Below I show the process using the ever-popular iris dataset.  I use the functions ddply and melt to both summarize and restructure the data into a form amenable to plotting.

 length.by.species = ddply(iris, "Species", function (x) quantile(x$Sepal.Length, c(.25,.5,.75)))
> length.by.species
     Species   25% 50% 75%
1     setosa 4.800 5.0 5.2
2 versicolor 5.600 5.9 6.3
3  virginica 6.225 6.5 6.9
length.by.species = melt(length.by.species, variable.name="Quantile",value.name="Sepal.Length")
length.by.species
     Species Quantile Sepal.Length
1     setosa      25%        4.800
2 versicolor      25%        5.600
3  virginica      25%        6.225
4     setosa      50%        5.000
5 versicolor      50%        5.900
6  virginica      50%        6.500
7     setosa      75%        5.200
8 versicolor      75%        6.300
9  virginica      75%        6.900

One thing you can see in my call to ddply is that the main qualitative variable, whose values are used to subset your data frame, is referred to using quotes.  Somehow I find that a bit weird (I’m used to referring to variables without quotes, I suppose!).  Other than that, the syntax for the ddply command is similar enough to the apply family of functions, so no more complaints here.  You can also see that once I call the function, it gives me a nice neat data frame where the quantiles I asked for are columns, and the values of the Species variable represent different rows (or subsets of the data frame).

The melt command is easy enough, simply wanting to know what to call the column that will represent the values in the column titles (Quantile!) and what to call the numeric measure that the values come from (Sepal.Length).

Now that the summary stats are in a “Long” form data frame, with one column representing the numbers, and two columns containing text, it’s just a simple one liner to create a graph (here done in ggplot).  Below I show one line to create a dodged bar graph, and another line to create a dot plot, both showing the 1st to 3rd quantiles of Sepal.Length by Species.

ggplot(length.by.species, aes(y=Sepal.Length, x=Species, fill=Quantile, stat="identity")) + geom_bar(position="dodge")
ggplot(length.by.species, aes(x=Sepal.Length, y=Species, colour=Quantile, stat="identity")) + geom_point(size=4)
bar graph from melted data frame dot plot from melted data frame

Thank you ddply and melt!


To leave a comment for the author, please follow the link and comment on his blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.