[This article was first published on fishR » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This example has been updated in this post.

I came across a “problem” today where I needed to create catch data for individual nets from length measurements made on individual fish in those nets.  In other words, I had data that showed three individual length measurements for Brook Trout, two measurements for Lake Trout, and two measurements for Rainbow Trout in net #1 and I needed a data frame that showed these catch amounts (i.e., the three, two, and two).  Of course, the real problem had more fish and more nets.

The ddply() function from the plyr package works very well for this type of problem as illustrated below.  Basically, this function is used to break down your original data frame into smaller groups (in this case nets), apply some function to each group (in this case compute the length of the fish length variable which will correspond to the number of fish caught), and then combine the results from each grouping back to a resultant data frame.  Hadley Wickham, the author of plyr, calls this the Split-Apply-Combine strategy.

In this case, ddply() takes the original data frame as the first argument, a formula that consists of the variables used to make the groupings (more about this below) as the second argument, the summarize() function (without the parentheses) as the third argument, and then the name of a new variable set equal to a function that computes a summary (length() of the fish length variable in this case).  In this case, the original data frame will be split into groups based on unique combinations of the net and species variables (note that the eff(ort) and temp(erature) variables are not unique from the net variable so they will be repeated with net in the final data frame).

# make some toy data
lens                    eff=rep(c(1,2,2),c(7,5,6)),
temp=rep(c(17,15.5,16.5),c(7,5,6)),
species=c(rep(c("BKT","LKT","RBT"),c(3,2,2)),
rep(c("BKT","LKT"),c(2,3)),
rep(c("BKT","RBT"),c(4,2))),
tl=round(rnorm(18,mean=100,sd=10),0)
)
lens

# now turn it into catch data
library(plyr)
catch1                 summarize,catch=length(tl))
catch1


A common problem with this type of data is that mean catch per net will not be computed properly because some species were not captured in some nets, but no zero for those species is entered for those nets.  The addZeroCatch() function in the FSA package can be used to automatically (though, not quickly) enter these zeroes.  This function requires the data frame with catches as the first argument, the name of the variable that identifies the net as the second argument, the name of the variable that identifies the species as the third argument, and a vector of names of variables that should be set to zero in the zerovar= argument.  This process is illustrated below.

# now add zeroes where needed
library(FSA)
catch2                        zerovar="catch")
# check it out -- sorted by net then species
catch2[order(catch2$net,catch2$species),]


Now, for example, the mean and SD of catch-per-unit-effort (CPE) per species can be computed.

# illustrate compute mean/sd CPE
catch2\$cpe ddply(catch2,~species,
summarize,mean.cpe=mean(cpe),sd.cpe=sd(cpe))


Obviously, this is a toy example, but it can be scaled up to larger projects.

Filed under: Fisheries Science, R Tagged: Data Manipulation, plyr, R