Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## A Question

Recently, a student, working on her Senior Thesis at Northland College, asked me the following question:

Attached is an Excel file with three “important to R” worksheets. The only thing that connects all 3 worksheets is the Lift.ID variable. Is there a way to tell R this so that I don’t have to paste all three worksheets onto one new worksheet and then manually format the data so that there is one Lift.ID that connects the Count sheet info to the Lift sheet info?

Or course, the best answer to this (type of) question would be to teach this student how to use a relational database and generate a query. Getting that into our curriculum is a discussion for another time (and venue).

However, as these data were fairly simple, her question can be completed in R with `merge()`. I demonstrate this below with two practice data frames that each have a Lift.ID variable. To test behavior, `df2` only had one species for `lift.ID==5` and neither species for `lift.ID==6`. The data frames were constructed with (again, these are just example data, real data would be imported from an external file)

```( df1 <- data.frame(Lift.ID=1:6,
eff=c(3,4,2,5,1,1),
loc=c("A","B","A","C","B","D")) )

## Lift.ID eff loc
## 1 1 3 A
## 2 2 4 B
## 3 3 2 A
## 4 4 5 C
## 5 5 1 B
## 6 6 1 D
( df2 <- data.frame(Lift.ID=c(rep(1:4,each=2),5),
species=c(rep(c("LKT","PWF"),4),"LKT"),
catch=round(rnorm(9,mean=20,sd=3),0)))

## Lift.ID species catch
## 1 1 LKT 14
## 2 1 PWF 21
## 3 2 LKT 14
## 4 2 PWF 20
## 5 3 LKT 15
## 6 3 PWF 19
## 7 4 LKT 21
## 8 4 PWF 18
## 9 5 LKT 23```

These two data frames can be joined to form one data frame with `merge()` where the first two arguments are the two data frames (the first one is considered to be the “x” data frame) and the `by=` argument contains the name of the variable that links the data frames. For example,

```( df.comb1 <- merge(df1,df2,by="Lift.ID") )

## Lift.ID eff loc species catch
## 1 1 3 A LKT 14
## 2 1 3 A PWF 21
## 3 2 4 B LKT 14
## 4 2 4 B PWF 20
## 5 3 2 A LKT 15
## 6 3 2 A PWF 19
## 7 4 5 C LKT 21
## 8 4 5 C PWF 18
## 9 5 1 B LKT 23```

Thus, for example, the catch-per-unit-effort (CPE) for each lift could be appended with

```df.comb1\$CPE <- df.comb1\$catch/df.comb1\$eff
df.comb1

## Lift.ID eff loc species catch CPE
## 1 1 3 A LKT 14 4.667
## 2 1 3 A PWF 21 7.000
## 3 2 4 B LKT 14 3.500
## 4 2 4 B PWF 20 5.000
## 5 3 2 A LKT 15 7.500
## 6 3 2 A PWF 19 9.500
## 7 4 5 C LKT 21 4.200
## 8 4 5 C PWF 18 3.600
## 9 5 1 B LKT 23 23.000```

## A Potential Problem

A potential problem exists if the student wanted to computed mean CPE by species. For `species=="LKT"` the mean will not include a zero CPE for `Lift.ID==6` and for `species=="PWF"` the mean will not include zeroes for each of the last two lifts.

This problem can be corrected by first including the `all.x=TRUE` argument to `merge()` such that each row (lift) in the “x” data frame will be included in the combined data frame whether or not there is a corresponding row in the “y” data frame. In other words, information about `Lift.ID==6` needs to be in the combined data frame. For example, compare the following result of `merge()` to the previous result.

```( df.comb2 <- merge(df1,df2,by="Lift.ID",all.x=TRUE) )

##    Lift.ID eff loc species catch
## 1        1   3   A     LKT    14
## 2        1   3   A     PWF    21
## 3        2   4   B     LKT    14
## 4        2   4   B     PWF    20
## 5        3   2   A     LKT    15
## 6        3   2   A     PWF    19
## 7        4   5   C     LKT    21
## 8        4   5   C     PWF    18
## 9        5   1   B     LKT    23
## 10       6   1   D     NA     NA```

However, this still does not completely solve the problem because there are no “zeroes” for catches. The “zeroes” can be added to the dataframe with `addZeroCatch()` from the FSA package. First, load the FSA package

```library(FSA)
```

The `addZeroCatch()` function requires the original data frame as the first argument, the name of the variable that defines the sampling event (i.e., a net lift) in the `eventvar=` argument, the variabile that identifies the species caught in the `speciesvar=` argument, and the variable that should receive the zeroes in the `zerovar=` argument. For example,

```( df.comb2 <- addZeroCatch(df.comb2,eventvar="Lift.ID",
specvar="species",zerovar="catch") )

##     Lift.ID eff loc species catch
## 1         1   3   A     LKT    14
## 2         1   3   A     PWF    21
## 3         2   4   B     LKT    14
## 4         2   4   B     PWF    20
## 5         3   2   A     LKT    15
## 6         3   2   A     PWF    19
## 7         4   5   C     LKT    21
## 8         4   5   C     PWF    18
## 9         5   1   B     LKT    23
## 10        6   1   D     NA     NA
## 102       6   1   D     LKT     0
## 91        5   1   B     PWF     0
## 101       6   1   D     PWF     0```

The last remaining problem is the row with the `NA`s that was created when the `all.x=TRUE=` argument was used (see row 10 above). This row (and if there were more rows like it) can be removed as follows

```( df.comb2 <- subset(df.comb2,!is.na(catch)) )

##     Lift.ID eff loc species catch
## 1         1   3   A     LKT    14
## 2         1   3   A     PWF    21
## 3         2   4   B     LKT    14
## 4         2   4   B     PWF    20
## 5         3   2   A     LKT    15
## 6         3   2   A     PWF    19
## 7         4   5   C     LKT    21
## 8         4   5   C     PWF    18
## 9         5   1   B     LKT    23
## 102       6   1   D     LKT     0
## 91        5   1   B     PWF     0
## 101       6   1   D     PWF     0```

Now the summary statistics for CPE for each species can be computed (note that `Summarize()` is from FSA).

```df.comb2\$CPE <- df.comb2\$catch/df.comb2\$eff
( Summarize(CPE~species,data=df.comb2,digits=2) )

##   species n mean   sd min   Q1 median   Q3  max percZero
## 1     LKT 6 7.14 8.13   0 3.68   4.43 6.79 23.0    16.67
## 2     PWF 6 4.18 3.80   0 0.90   4.30 6.50  9.5    33.33```

Compare these to the incorrect values from the first use of `merge()`.

```( Summarize(CPE~species,data=df.comb1,digits=2) )

##   species n mean   sd min   Q1 median   Q3  max percZero
## 1     LKT 5 8.57 8.21 3.5 4.20   4.67 7.50 23.0        0
## 2     PWF 4 6.28 2.56 3.6 4.65   6.00 7.62  9.5        0```

Filed under: Fisheries Science, R Tagged: Data Manipulation, R