Posted Question for R Users

January 21, 2011

(This article was first published on Econometric Sense, and kindly contributed to R-bloggers)

I recently undertook a project where a colleague had about 12 .csv files that they wanted to merge. Each file had a common (key) variable ‘Partner’ (which is trading partner) with differing columns (variables) except for the common key variable. Actually, the first column (Partner) was the same in each data set, with each data set having 211 rows. The remaining columns were unique to each data set. I required a combined data set.

In SAS (PROC SQL) this would be a matter of a series of left joins. Working in the R environment, I executed what was equivalent to left joins via the ‘merge’ function. However this was tedious, only being able to join 2 tables at a time. In SAS I could use the merge function, which would allow me to merge all 12 tables in 1 data step.

Does anyone know of a better way to do this in R, as opposed to my merge statements? (see sample code below)

R Merge Statements:

# -------------------------------------------------------------------
#| merge data sets with R merge function
# ------------------------------------------------------------------

# left join b_dat onto a_dat on variable Partner
ab <- merge(a_dat,b_dat, by.a_dat = Partner, by.b_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
# left join c_dat onto ab on variable Partner
a_c <- merge(ab,c_dat, by.ab = Partner, by.c_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
# I have about 10 more data sets to left join with a_c, is there a better way to join these
# in R as opposed to pairwise merges like above?

I found some help on r-wiki, it merges all 3 data sets at once, but gives me extra redundant columns with .x and .y appended to their names. I’m not sure about these results.

my.list <- list(a_dat, b_dat, c_dat)

DF <- a_dat
for ( .df in my.list ) {
DF <-merge(DF,.df,by.x="Partner", by.y="Partner", all = FALSE, all.x = TRUE, all.y = FALSE)

SAS – similar code for 3 data sets DAT_A, DAT_B, DAT_C

IF B AND C; /* Actually its been so long since */
RUN; /* I have used a data step vs PROC SQL I’m not sure this*/
/* statement gives me a left join, but I’m not sure I can */
/* do any better than 2 at a time left joins in PROC SQL */
/* so that would not be any better than my R code above */

To leave a comment for the author, please follow the link and comment on their blog: Econometric Sense. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)