# Plotting grouped data vs time with error bars in R

October 31, 2011
By

(This article was first published on Daniel Hocking, Ecologist » Blog, and kindly contributed to R-bloggers)

This is my first blog since joining R-bloggers. I’m quite excited to be part of this group and apologize if I bore any experienced R users with my basic blogs for learning R or offend programmers with my inefficient, sloppy coding. Hopefully writing for this most excellent community will help improve my R skills while helping other novice R users.

I have a dataset from my dissertation research were I repeatedly counted salamanders from some plots and removed salamanders from other plots. I was interested in plotting the captures over time (sensu Hairston 1987). As with all statistics and graphs in R, there are a variety of ways to create the same or similar output. One challenge I always have is dealing with dates. Another challenge is plotting error bars on plots. Now I had the challenge of plotting the average captures per night or month ± SE for two groups (reference and depletion plots – 5 of each on each night) vs. time. This is the type of thing that could be done in 5 minutes on graph paper by hand but took me a while in R and I’m still tweaking various plots. Below I explore a variety of ways to handle and plot this data. I hope it’s helpful for others. Chime in with comments if you have any suggestions or similar experiences.
> ####################################################################
> # Summary of Count and Removal Data
> # Dissertation Project 2011
> # October 25, 2011
> # Daniel J. Hocking
> ####################################################################
>
> Data <- read.table(‘/Users/Dan/…/AllCountsR.txt’, header = TRUE, na.strings = “NA”)
>
> str(Data)
‘data.frame’:            910 obs. of  32 variables:
\$ dates            : Factor w/ 91 levels “04/06/09″,”04/13/11″,..: 12 14 15 16 17 18 19 21 22 23 …
You’ll notice that when importing a text file created in excel with the default date format, R treats the date variable as a Factor within the data frame. We need to convert it to a date form that R can recognize. Two such built-in functions are as.Date and as.POSIXct. The latter is a more common format and the one I choose to use (both are very similar but not fully interchangeable). To get the data in the POSIXct format in this case I use the strptime function as seen below. I also create a couple new columns of day, month, and year in case they become useful for aggregating or summarizing the data later.
>
> dates <-strptime(as.character(Data\$dates), “%m/%d/%y”)  # Change date from excel storage to internal R format
>
> dim(Data)
[1] 910  32
> Data = Data[,2:32]               # remove
> Data = data.frame(date = dates, Data)      #this is now the date in useful fashion
>
> Data\$mo <- strftime(Data\$date, “%m”)
> Data\$mon <- strftime(Data\$date, “%b”)
> Data\$yr <- strftime(Data\$date, “%Y”)
>
> monyr <- function(x)
+ {
+     x <- as.POSIXlt(x)
+     x\$mday <- 1
+     as.POSIXct(x)
+ }
>
> Data\$mo.yr <- monyr(Data\$date)
>
> str(Data)
‘data.frame’:            910 obs. of  36 variables:
\$ date             : POSIXct, format: “2008-05-17″ “2008-05-18″ …
\$ mo               : chr  “05″ “05″ “05″ “05″ …
\$ mon              : chr  “May” “May” “May” “May” …
\$ yr               : chr  “2008″ “2008″ “2008″ “2008″ …
\$ mo.yr            : POSIXct, format: “2008-05-01 00:00:00″ “2008-05-01 00:00:00″ …
>
As you can see the date is now in the internal R date form of POSIXct (YYYY-MM-DD).
Now I use a custom function to summarize each night of counts and removals. I forgot offhand how to call to a custom function stored elsewhere to I lazily pasted it in my script. I found this nice little function online but I apologize to the author because I don’t remember were I found it.
> library(ggplot2)
> library(doBy)
>
> ## Summarizes data.
> ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
> ## If there are within-subject variables, calculate adjusted values using method from Morey (2008).
> ##   measurevar: the name of a column that contains the variable to be summariezed
> ##   groupvars: a vector containing names of columns that contain grouping variables
> ##   na.rm: a boolean that indicates whether to ignore NA’s
> ##   conf.interval: the percent range of the confidence interval (default is 95%)
> summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
+     require(doBy)
+
+     # New version of length which can handle NA’s: if na.rm==T, don’t count them
+     length2 <- function (x, na.rm=FALSE) {
+         if (na.rm) sum(!is.na(x))
+         else       length(x)
+     }
+
+     # Collapse the data
+     formula <- as.formula(paste(measurevar, paste(groupvars, collapse=” + “), sep=” ~ “))
+     datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
+
+     # Rename columns
+     names(datac)[ names(datac) == paste(measurevar, “.mean”,    sep=””) ] <- measurevar
+     names(datac)[ names(datac) == paste(measurevar, “.sd”,      sep=””) ] <- “sd”
+     names(datac)[ names(datac) == paste(measurevar, “.length2″, sep=””) ] <- “N”
+
+     datac\$se <- datac\$sd / sqrt(datac\$N)  # Calculate standard error of the mean
+
+     # Confidence interval multiplier for standard error
+     # Calculate t-statistic for confidence interval:
+     # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
+     ciMult <- qt(conf.interval/2 + .5, datac\$N-1)
+     datac\$ci <- datac\$se * ciMult
+
+     return(datac)
+ }
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> gCount <- summarySE(Count, measurevar=”count”, groupvars=c(“date”,”trt”))
Now I’m ready to plot. I’ll start with a line graph of the two treatment plot types.
> ### Line Plot ###
> # Ref: Quick-R
>
> # convert factor to numeric for convenience
> gCount\$trtcode <- as.numeric(gCount\$trt)
> ntrt <- max(gCount\$trtcode)
>
> # ranges for x and y axes
> xrange <- range(gCount\$date)
> yrange <- range(gCount\$count)
>
> # Set up blank plot
> plot(gCount\$date, gCount\$count, type = “n”,
+      xlab = “Date”,
+      ylab = “Mean number of salamanders per night”)
>
> # add lines
> color <- seq(1:ntrt)
> line <- seq(1:ntrt)
> for (i in 1:ntrt){
+   Treatment <- subset(gCount, gCount\$trtcode==i)
+   lines(Treatment\$date, Treatment\$count, col = color[i])#, lty = line[i])
+ }

As you can see this is not attractive but it does show that I generally caught more salamander in the reference (red line) plots. This makes it apparent that a line graph is probably a bit messy for this data and might even be a bit misleading because data was not collected continuously or at even intervals (weather and season dependent collection). So, I tried a plot of just the points using the scatterplot function from the “car” package.
> ### package car scatterplot by groups ###
> library(car)
>
> # Plot
> scatterplot(count ~ date + trt, data = gCount,
+             smooth = FALSE, grid = FALSE, reg.line = FALSE,
+    xlab=”Date”, ylab=”Mean number of salamanders per night”)

This was nice because it was very simple to code. It includes points from every night but I would still like to summarize it more. Before I get to that, I would like to try having breaks between the years. The lattice package should be useful for this.
> library(lattice)
>
> # Add year, month, and day to dataframe
> chardate <- as.character(gCount\$date)
> splitdate <- strsplit(chardate, split = “-”)
> gCount\$year <- as.numeric(unlist(lapply(splitdate, “[“, 1)))
> gCount\$month <- as.numeric(unlist(lapply(splitdate, “[“, 2)))
> gCount\$day <- as.numeric(unlist(lapply(splitdate, “[“, 3)))
>
>
> # Plot
> xyplot(count ~ trt + date | year,
+ data = gCount,
+ ylab=”Daily salamander captures”, xlab=”date”,
+ pch = seq(1:ntrt),
+ scales=list(x=list(alternating=c(1, 1, 1))),
+ between=list(y=1),
+ par.strip.text=list(cex=0.7),
+ par.settings=list(axis.text=list(cex=0.7)))

Obviously there is a problem with this. I am not getting proper overlaying of the two treatments. I tried adjusting the equation (e.g. count ~ month | year*trt), but nothing was that enticing and I decided to go back to other plotting functions. The lattice package is great for trellis plots and visualizing mixed effects models.
I now decided to summarize the data by month rather than by day and add standard error bars. This goes back to using the base plot function.
> ### Line Plot ###
> # Ref: http://personality-project.org/r/r.plottingdates.html
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> mCount <- summarySE(Count, measurevar=”count”, groupvars=c(“mo.yr”,”trt”))
> refmCount <- subset(mCount, mCount\$trt == “Reference”)
> depmCount <- subset(mCount, mCount\$trt == “Depletion”)
>
> daterange=c(as.POSIXct(min(mCount\$mo.yr)),as.POSIXct(max(mCount\$mo.yr)))
>
> # determine the lowest and highest months
> ylims <- c(0, max(mCount\$count + mCount\$se))
> r <- as.POSIXct(range(refmCount\$mo.yr), “month”)
>
> plot(refmCount\$mo.yr, refmCount\$count, type = “n”, xaxt = “n”,
+      xlab = “Date”,
+      ylab = “Mean number of salamanders per night”,
+      xlim = c(r[1], r[2]),
+      ylim = ylims)
> axis.POSIXct(1, at = seq(r[1], r[2], by = “month”), format = “%b”)
> points(refmCount\$mo.yr, refmCount\$count, type = “p”, pch = 19)
> points(depmCount\$mo.yr, depmCount\$count, type = “p”, pch = 24)
> arrows(refmCount\$mo.yr, refmCount\$count+refmCount\$se, refmCount\$mo.yr, refmCount\$count-refmCount\$se, angle=90, code=3, length=0)
> arrows(depmCount\$mo.yr, depmCount\$count+depmCount\$se, depmCount\$mo.yr, depmCount\$count-depmCount\$se, angle=90, code=3, length=0)

Now that’s a much better visualization of the data and that’s the whole goal of a figure for publication. The only thing I might change would be I might plot by year with the labels of Month-Year (format = %b \$Y). I might add a legend but with only two treatments I might just include the info in the figure description.
Although that is probably going to be my final product for my current purposes, I wanted to explore a few other graphing options for visualizing this data. The first is to use box plots. I use the add = TRUE option to add a second group after subsetting the data.
> ### Boxplot ###
> # Ref: http://personality-project.org/r/r.plottingdates.html
>
> #    as.POSIXlt(date)\$mon     #gives the months in numeric order mod 12 with January = 0 and December = 11
>
> refboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Reference”,
+                       ylab = “Mean number of salamanders per night”,
+                       xlab = “Date”)   #show the graph and save the data
> depboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Depletion”, col = 2, add = TRUE)
>

Clearly this is a mess and not useful. But you can imagine that with some work and summarizing by month or season it could be a useful and efficient way to present the data. Next I tried the popular package ggplot2.
> ### ggplot ###
> # Refs: http://learnr.wordpress.com/2010/02/25/ggplot2-plotting-dates-hours-and-minutes/
> # http://had.co.nz/ggplot2/
> # http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
>
> library(ggplot2)
>
> ggplot(data = gCount, aes(x = date, y = count, group = trt)) +
+     #geom_point(aes(shape = factor(trt))) +
+     geom_point(aes(colour = factor(trt), shape = factor(trt)), size = 3) +
+     #geom_line() +
+     geom_errorbar(aes(ymin=count-se, ymax=count+se), width=.1) +
+     #geom_line() +
+    # scale_shape_manual(values=c(24,21)) + # explicitly have sham=fillable triangle, ACCX=fillable circle
+     #scale_fill_manual(values=c(“white”,”black”)) + # explicitly have sham=white, ACCX=black
+     xlab(“Date”) +
+     ylab(“Mean number of salamander captures per night”) +
+     scale_colour_hue(name=”Treatment”, # Legend label, use darker colors
+                      l=40) +                  # Use darker colors, lightness=40
+     theme_bw() + # make the theme black-and-white rather than grey (do this before font changes, or it overrides them)
+
+     opts(legend.position=c(.2, .9), # Position legend inside This must go after theme_bw
+           panel.grid.major = theme_blank(), # switch off major gridlines
+           panel.grid.minor = theme_blank(), # switch off minor gridlines
+          legend.title = theme_blank(), # switch off the legend title
+          legend.key = theme_blank()) # switch off the rectangle around symbols in the legend
This plot could work with some fine tuning, especially with the legend(s) but you get the idea. It wasn’t as easy for me as the plot function but ggplot is quite versatile and probably a good package to have in your back pocket for complicated graphing.
Next up was the gplots package for the plotCI function.
> library(gplots)
>
> plotCI(
+   x = refmCount\$mo.yr,
+   y = refmCount\$count,
+   uiw = refmCount\$se, # error bar length (default is to put this much above and below point)
+   pch = 24, # symbol (plotting character) type: see help(pch); 24 = filled triangle pointing up
+   pt.bg = “white”, # fill colour for symbol
+   cex = 1.0, # symbol size multiplier
+   lty = 1, # error bar line type: see help(par) – not sure how to change plot lines
+   type = “p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap = 0, # distance from symbol to error bar
+   sfrac = 0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xlab = “Year”, # x axis label
+   ylab = “Mean number of salamanders per night”,
+   las = 1, # axis labels horizontal (default is 0 for always parallel to axis)
+   font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold italic, 5 symbol
+   xaxt = “n”) # Don’t print x-axis
> )
>   axis.POSIXct(1, at = seq(r[1], r[2], by = “year”), format = “%b %Y”) # label the x axis by month-years
>
> plotCI(
+   x=depmCount\$mo.yr,
+   y = depmCount\$count,
+   uiw=depmCount\$se, # error bar length (default is to put this much above and below point)
+   pch=21, # symbol (plotting character) type: see help(pch); 21 = circle
+   pt.bg=”grey”, # fill colour for symbol
+   cex=1.0, # symbol size multiplier
+   lty=1, # line type: see help(par)
+   type=”p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap=0, # distance from symbol to error bar
+   sfrac=0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xaxt = “n”,
+   add=TRUE # ADD this plot to the previous one
+ )

Now this is a nice figure. I must say that I like this. It is very similar to the standard plot code and graph but it was a little easier to add the error bars.
So that’s it, what a fun weekend I had! Let me know what you think or if you have any suggestions. I’m new to this and love to learn new ways of coding things in R.

To leave a comment for the author, please follow the link and comment on their blog: Daniel Hocking, Ecologist » Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , , , , , , , , , , ,

Comments are closed.

## Sponsors

Contact us if you wish to help support R-bloggers, and place your banner here.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)