Merge Relational Dataframes

November 11, 2013
By

(This article was first published on fishR » R, and kindly contributed to R-bloggers)

A Question

Recently, a student, working on her Senior Thesis at Northland College, asked me the following question:

Attached is an Excel file with three “important to R” worksheets. The only thing that connects all 3 worksheets is the Lift.ID variable. Is there a way to tell R this so that I don’t have to paste all three worksheets onto one new worksheet and then manually format the data so that there is one Lift.ID that connects the Count sheet info to the Lift sheet info?

An Answer

Or course, the best answer to this (type of) question would be to teach this student how to use a relational database and generate a query. Getting that into our curriculum is a discussion for another time (and venue).

However, as these data were fairly simple, her question can be completed in R with merge(). I demonstrate this below with two practice data frames that each have a Lift.ID variable. To test behavior, df2 only had one species for lift.ID==5 and neither species for lift.ID==6. The data frames were constructed with (again, these are just example data, real data would be imported from an external file)

( df1 <- data.frame(Lift.ID=1:6,
                    eff=c(3,4,2,5,1,1),
                    loc=c("A","B","A","C","B","D")) )
## Lift.ID eff loc
## 1 1 3 A
## 2 2 4 B
## 3 3 2 A
## 4 4 5 C
## 5 5 1 B
## 6 6 1 D
( df2 <- data.frame(Lift.ID=c(rep(1:4,each=2),5),
                    species=c(rep(c("LKT","PWF"),4),"LKT"),
                    catch=round(rnorm(9,mean=20,sd=3),0)))
## Lift.ID species catch
## 1 1 LKT 14
## 2 1 PWF 21
## 3 2 LKT 14
## 4 2 PWF 20
## 5 3 LKT 15
## 6 3 PWF 19
## 7 4 LKT 21
## 8 4 PWF 18
## 9 5 LKT 23

These two data frames can be joined to form one data frame with merge() where the first two arguments are the two data frames (the first one is considered to be the “x” data frame) and the by= argument contains the name of the variable that links the data frames. For example,

( df.comb1 <- merge(df1,df2,by="Lift.ID") )
## Lift.ID eff loc species catch
## 1 1 3 A LKT 14
## 2 1 3 A PWF 21
## 3 2 4 B LKT 14
## 4 2 4 B PWF 20
## 5 3 2 A LKT 15
## 6 3 2 A PWF 19
## 7 4 5 C LKT 21
## 8 4 5 C PWF 18
## 9 5 1 B LKT 23

Thus, for example, the catch-per-unit-effort (CPE) for each lift could be appended with

df.comb1$CPE <- df.comb1$catch/df.comb1$eff
df.comb1
## Lift.ID eff loc species catch CPE
## 1 1 3 A LKT 14 4.667
## 2 1 3 A PWF 21 7.000
## 3 2 4 B LKT 14 3.500
## 4 2 4 B PWF 20 5.000
## 5 3 2 A LKT 15 7.500
## 6 3 2 A PWF 19 9.500
## 7 4 5 C LKT 21 4.200
## 8 4 5 C PWF 18 3.600
## 9 5 1 B LKT 23 23.000

A Potential Problem

A potential problem exists if the student wanted to computed mean CPE by species. For species=="LKT" the mean will not include a zero CPE for Lift.ID==6 and for species=="PWF" the mean will not include zeroes for each of the last two lifts.

This problem can be corrected by first including the all.x=TRUE argument to merge() such that each row (lift) in the “x” data frame will be included in the combined data frame whether or not there is a corresponding row in the “y” data frame. In other words, information about Lift.ID==6 needs to be in the combined data frame. For example, compare the following result of merge() to the previous result.

( df.comb2 <- merge(df1,df2,by="Lift.ID",all.x=TRUE) )
##    Lift.ID eff loc species catch
## 1        1   3   A     LKT    14
## 2        1   3   A     PWF    21
## 3        2   4   B     LKT    14
## 4        2   4   B     PWF    20
## 5        3   2   A     LKT    15
## 6        3   2   A     PWF    19
## 7        4   5   C     LKT    21
## 8        4   5   C     PWF    18
## 9        5   1   B     LKT    23
## 10       6   1   D     NA     NA

However, this still does not completely solve the problem because there are no “zeroes” for catches. The “zeroes” can be added to the dataframe with addZeroCatch() from the FSA package. First, load the FSA package

library(FSA)

The addZeroCatch() function requires the original data frame as the first argument, the name of the variable that defines the sampling event (i.e., a net lift) in the eventvar= argument, the variabile that identifies the species caught in the speciesvar= argument, and the variable that should receive the zeroes in the zerovar= argument. For example,

( df.comb2 <- addZeroCatch(df.comb2,eventvar="Lift.ID",
                           specvar="species",zerovar="catch") )
##     Lift.ID eff loc species catch
## 1         1   3   A     LKT    14
## 2         1   3   A     PWF    21
## 3         2   4   B     LKT    14
## 4         2   4   B     PWF    20
## 5         3   2   A     LKT    15
## 6         3   2   A     PWF    19
## 7         4   5   C     LKT    21
## 8         4   5   C     PWF    18
## 9         5   1   B     LKT    23
## 10        6   1   D     NA     NA
## 102       6   1   D     LKT     0
## 91        5   1   B     PWF     0
## 101       6   1   D     PWF     0

The last remaining problem is the row with the NAs that was created when the all.x=TRUE= argument was used (see row 10 above). This row (and if there were more rows like it) can be removed as follows

( df.comb2 <- subset(df.comb2,!is.na(catch)) )
##     Lift.ID eff loc species catch
## 1         1   3   A     LKT    14
## 2         1   3   A     PWF    21
## 3         2   4   B     LKT    14
## 4         2   4   B     PWF    20
## 5         3   2   A     LKT    15
## 6         3   2   A     PWF    19
## 7         4   5   C     LKT    21
## 8         4   5   C     PWF    18
## 9         5   1   B     LKT    23
## 102       6   1   D     LKT     0
## 91        5   1   B     PWF     0
## 101       6   1   D     PWF     0

Now the summary statistics for CPE for each species can be computed (note that Summarize() is from FSA).

df.comb2$CPE <- df.comb2$catch/df.comb2$eff
( Summarize(CPE~species,data=df.comb2,digits=2) )
##   species n mean   sd min   Q1 median   Q3  max percZero
## 1     LKT 6 7.14 8.13   0 3.68   4.43 6.79 23.0    16.67
## 2     PWF 6 4.18 3.80   0 0.90   4.30 6.50  9.5    33.33

Compare these to the incorrect values from the first use of merge().

( Summarize(CPE~species,data=df.comb1,digits=2) )
##   species n mean   sd min   Q1 median   Q3  max percZero
## 1     LKT 5 8.57 8.21 3.5 4.20   4.67 7.50 23.0        0
## 2     PWF 4 6.28 2.56 3.6 4.65   6.00 7.62  9.5        0

Filed under: Fisheries Science, R Tagged: Data Manipulation, R

To leave a comment for the author, please follow the link and comment on their blog: fishR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)