Modified lencat() — Increased Flexibility with dplyr

[This article was first published on fishR » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the first functions in the FSA package was lencat(), which served me well over the years. However, I have been bothered by the use of a formula and data= to identify a single column to be “transformed” and that an “automatic” determination of startcat= was not coded. Additionally, lencat() did not work well with dplyr, which I recently discovered (see my introduction). Thus, I have reworked lencat() in the latest FSA to handle these issues while maintaining the original functionality.

The modified lencat() behaves slightly differently depending on how the user supplies the fish lengths. If the user provides a formula and data=, then lencat() will return a data.frame with the new variable appended. This is the exact same behavior as the original lencat(). However, if the user supplies a vector as the first argument, then lencat() will now return a single vector of the length categorization values. Additionally, in both uses, the user can leave startcat= blank and a reasonable starting value (i.e., a value just below the minimum observed value that “makes sense” given w=) will be used.

The new functionality of lencat() is demonstrated below. First, I loaded the FSA and dplyr packages.

library(FSA)
library(dplyr)

Smallmouth Bass length data from a lake in Minnesota will be used and for the sake of simplicity, all variables related to measurements on the scales of the fish (i.e., all variables containing “anu” and “radcap”) and the species and lake (because they were constant at “SMB” and “WB”) were removed.

data(SMBassWB)
smb1 <- SMBassWB %.%
  select(-contains("anu"),-radcap,-species,-lake)
smb3 <- smb2 <- smb1 # copies for later use
str(smb1)

## 'data.frame':    445 obs. of  5 variables:
##  $ gear   : Factor w/ 2 levels "E","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ yearcap: int  1988 1988 1988 1988 1988 1988 1989 1990 1990 1990 ...
##  $ fish   : int  5 3 2 4 6 7 50 482 768 428 ...
##  $ agecap : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lencap : int  71 64 57 68 72 80 55 75 75 71 ...

Note that the length measurements are in the lencap variable.

Introductory Example of New Functionality

As a foundational example, lencat() is used below to create a new vector of 10-mm length categories for the lengths. Only the first 12 length-categories are shown (using head()) to save space.

tmp <- lencat(smb1$lencap,w=10)
head(tmp,n=12)

##  [1]  70  60  50  60  70  80  50  70  70  70 100  50

These length categories can be added to the data frame as follows.

smb1$LCat10 <- lencat(smb1$lencap,w=10)
head(smb1)

##   gear yearcap fish agecap lencap LCat10
## 1    E    1988    5      1     71     70
## 2    E    1988    3      1     64     60
## 3    E    1988    2      1     57     50
## 4    E    1988    4      1     68     60
## 5    E    1988    6      1     72     70
## 6    E    1988    7      1     80     80

The same variable can be added using mutate() from dplyr as follows.

smb1 <- mutate(smb1,LCat10=lencat(lencap,w=10))
head(smb1)

##   gear yearcap fish agecap lencap LCat10
## 1    E    1988    5      1     71     70
## 2    E    1988    3      1     64     60
## 3    E    1988    2      1     57     50
## 4    E    1988    4      1     68     60
## 5    E    1988    6      1     72     70
## 6    E    1988    7      1     80     80

The advantage of using dplyr in this way is that you can string together multiple data manipulations. For example, one could create the variable as above but then order the rows of the data.frame in ascending length category values as follows.

smb1 <- smb1 %.%
  mutate(LCat10=lencat(lencap,w=10)) %.%
  arrange(LCat10)
head(smb1)

##   gear yearcap fish agecap lencap LCat10
## 1    E    1988    2      1     57     50
## 2    E    1989   50      1     55     50
## 3    T    1988    2      1     57     50
## 4    E    1988    3      1     64     60
## 5    E    1988    4      1     68     60
## 6    T    1988    3      1     64     60

Extended Example of New Functionality

In the examples above, the 10-mm length categories were created without the use of startcat=. The lencat() function found the first even 10-mm length category (50) below the minimum observed value (55) and created length categories from that. One can still set the value for the starting category with startcat= as follows.

smb1 <- smb1 %.%
  mutate(LCat10=lencat(lencap,w=10,startcat=55)) %.%
  arrange(LCat10)
head(smb1)

##   gear yearcap fish agecap lencap LCat10
## 1    E    1988    2      1     57     55
## 2    E    1989   50      1     55     55
## 3    T    1988    2      1     57     55
## 4    E    1988    3      1     64     55
## 5    T    1988    3      1     64     55
## 6    E    1988    4      1     68     65

However, the automatic startcat= seems to be a useful feature for a wide variety of different values of w= as demonstrated below.

smb1 <- smb1 %.%
  mutate(LCat5=lencat(lencap,w=5)) %.%
  mutate(LCat10=lencat(lencap,w=10)) %.%
  mutate(LCat25=lencat(lencap,w=25)) %.%
  arrange(lencap)
head(smb1,n=10)

##    gear yearcap fish agecap lencap LCat10 LCat5 LCat25
## 1     E    1989   50      1     55     50    55     50
## 2     E    1988    2      1     57     50    55     50
## 3     T    1988    2      1     57     50    55     50
## 4     E    1988    3      1     64     60    60     50
## 5     T    1988    3      1     64     60    60     50
## 6     E    1988    4      1     68     60    65     50
## 7     T    1988    4      1     68     60    65     50
## 8     E    1988    5      1     71     70    70     50
## 9     E    1990  428      1     71     70    70     50
## 10    T    1988    5      1     71     70    70     50

The default type returned by lencat() is numeric. This can result in “missing categories” in length frequency distributions. For example, the length frequency distribution for 25-mm length categories shown below is missing the 375- and 400-mm categories.

xtabs(~LCat25,data=smb1)

## LCat25
##  50  75 100 125 150 175 200 225 250 275 300 325 350 425 
##  12  14  52  58  60  37  51  45  48  50   9   6   2   1

The problem with missing length categories can be corrected by having the values returned as a factor rather than a numeric. The return values are forced to be a factor by including as.fact=TRUE to lencat() as shown below.

smb1 <- smb1 %.%
  mutate(LCat25f=lencat(lencap,w=25,as.fact=TRUE))
xtabs(~LCat25f,data=smb1)

## LCat25f
##  50  75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 
##  12  14  52  58  60  37  51  45  48  50   9   6   2   0   0   1

Finally, one can still use breaks= to set specific and potentially unequally-spaced values for the length categories. The example below finds the Gabelhouse five-cell length categories for Smallmouth Bass and then creates two new variables from these values – one that will show the length values and one that shows the category name values. To further exhibit the use of dplyr I also removed (i.e., use filter()) all fish that were less than “stock” size (i.e., the zero category).

( brks <- psdVal("Smallmouth Bass",units="mm") )

##      zero     stock   quality preferred memorable    trophy 
##         0       180       280       350       430       510

smb2 <- smb2 %.%
  mutate(LCatPSD1=lencat(lencap,breaks=brks)) %.%
  mutate(LCatPSD2=lencat(lencap,breaks=brks,use.names=TRUE)) %.%
  arrange(lencap) %.%
  filter(LCatPSD2 != "zero")
head(smb2,n=10)

##    gear yearcap fish agecap lencap LCatPSD1 LCatPSD2
## 1     E    1990  415      3    180      180    stock
## 2     E    1988   28      5    180      180    stock
## 3     E    1988   29      5    180      180    stock
## 4     T    1988   29      5    180      180    stock
## 5     T    1988   28      5    180      180    stock
## 6     E    1990  700      3    182      180    stock
## 7     T    1989   40      3    183      180    stock
## 8     T    1989   98      2    187      180    stock
## 9     E    1990  760      3    187      180    stock
## 10    E    1990  399      4    187      180    stock

xtabs(~LCatPSD1,data=smb2)

## LCatPSD1
## 180 280 350 430 
## 188  54   2   1

xtabs(~LCatPSD2,data=smb2)

## LCatPSD2
##      zero     stock   quality preferred memorable    trophy 
##         0       188        54         2         1         0

Note that the categories without any fish are still shown in the last table. This can be adjusted with droplevels() as follows.

smb2 <- droplevels(smb2)
xtabs(~LCatPSD2,data=smb2)

## LCatPSD2
##     stock   quality preferred memorable 
##       188        54         2         1

The Old Functionality Is Still There

The “old” functional of lencat() still exists so that your old code with lencat() is not broken (with the minor exception that use.catnames= is now use.names=).

smb3 <- lencat(~lencap,data=smb3,w=10)
smb3 <- lencat(~lencap,data=smb3,w=25,vname="LenCat25")
smb3 <- lencat(~lencap,data=smb3,breaks=psdVal("Smallmouth Bass"),
               vname="LenPsd")
smb3 <- lencat(~lencap,data=smb3,breaks=psdVal("Smallmouth Bass"),
               vname="LenPsd2",use.names=TRUE,drop.levels=TRUE)
head(smb3,n=10)

##    gear yearcap fish agecap lencap LCat LenCat25 LenPsd LenPsd2
## 1     E    1988    5      1     71   70       50      0    zero
## 2     E    1988    3      1     64   60       50      0    zero
## 3     E    1988    2      1     57   50       50      0    zero
## 4     E    1988    4      1     68   60       50      0    zero
## 5     E    1988    6      1     72   70       50      0    zero
## 6     E    1988    7      1     80   80       75      0    zero
## 7     E    1989   50      1     55   50       50      0    zero
## 8     E    1990  482      1     75   70       75      0    zero
## 9     E    1990  768      1     75   70       75      0    zero
## 10    E    1990  428      1     71   70       50      0    zero

This functionality is particularly useful when you want to create a new data.frame from the old data.frame but with the appended length category variable.


Filed under: Fisheries Science, R Tagged: Data, Manipulation, R

To leave a comment for the author, please follow the link and comment on their blog: fishR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)