Posted Question for R Users

Posted on January 21, 2011 by Matt Bogard in R bloggers | 0 Comments

[This article was first published on Econometric Sense, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently undertook a project where a colleague had about 12 .csv files that they wanted to merge. Each file had a common (key) variable ‘Partner’ (which is trading partner) with differing columns (variables) except for the common key variable. Actually, the first column (Partner) was the same in each data set, with each data set having 211 rows. The remaining columns were unique to each data set. I required a combined data set.

In SAS (PROC SQL) this would be a matter of a series of left joins. Working in the R environment, I executed what was equivalent to left joins via the ‘merge’ function. However this was tedious, only being able to join 2 tables at a time. In SAS I could use the merge function, which would allow me to merge all 12 tables in 1 data step.

Does anyone know of a better way to do this in R, as opposed to my merge statements? (see sample code below)

R Merge Statements:

# -------------------------------------------------------------------
#|  merge data sets with R merge function              
# ------------------------------------------------------------------
 
#  left join b_dat onto a_dat on variable Partner
 
 ab <- merge(a_dat,b_dat, by.a_dat = Partner, by.b_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 
# left join c_dat onto ab on variable Partner
 
 a_c <- merge(ab,c_dat, by.ab = Partner, by.c_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 
# I have about 10 more data sets to left join with a_c, is there a better way to join these 
# in R as opposed to pairwise merges like above?

I found some help on r-wiki, it merges all 3 data sets at once, but gives me extra redundant columns with .x and .y appended to their names. I'm not sure about these results.

my.list <- list(a_dat, b_dat, c_dat)
 
DF <- a_dat
for ( .df in my.list ) {
  DF <-merge(DF,.df,by.x="Partner", by.y="Partner", all = FALSE, all.x = TRUE, all.y = FALSE)
 }
 
dim(DF)

SAS - similar code for 3 data sets DAT_A, DAT_B, DAT_C

DATA ALL_COUNTRIES;
MERGE DAT_A (IN=A) DAT_B (IN=B)DAT_C (IN=C);
BY PARTNER;
IF B AND C; /* Actually its been so long since */
RUN; /* I have used a data step vs PROC SQL I'm not sure this*/
/* statement gives me a left join, but I'm not sure I can */
/* do any better than 2 at a time left joins in PROC SQL */
/* so that would not be any better than my R code above */

To leave a comment for the author, please follow the link and comment on their blog: Econometric Sense.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Posted Question for R Users

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)