Merging Multiple Data Frames in R

January 23, 2011
By

(This article was first published on Econometric Sense, and kindly contributed to R-bloggers)

Earlier I had a problem that required merging 3 years of trade data, with about 12 csv files per year. Merging all of these data sets with pairwise left joins using the R merge statement worked (especially after correcting some errors pointed out by Hadley Wickham However, in both my hobby hacking and on the job, I was curious if there might be a better way to do this than countless sets of merge statements (not to mention the multiple lines of code required for reading in the csv files)

So, I sent a tweet to the #rstats followers with a link to where I posted the problem on this blog to see if I could get a hint. (twitter has been a very valuable networking tool for me, I've learned a lot about data mining, machine learning, and R from the tweet-stream. Tweets and blog posts from people like Hadley Wickham, Drew Conway, and J.D. Long have been tremendously helpful to me as I've taken up R.

Back to the topic at hand, below is my new code, based on suggestions from Hadley Wickham and a comment (from Harlan) that lead me to some answers to a similar question on stack overflow. The code below requires he reshape library as well as plyr, which I should mention appears to have been created by Hadley Wickham himself.

# read all Y2000 csv files
 
filenames <- list.files(path="/Users/wkuuser/Desktop/R Data Sets/TRADE_DATA/TempData00", full.names=TRUE)
import.list <- llply(filenames, read.csv)
 
# left join all Y2000 csv files
 
AllData00 <- Reduce(function(x, y) merge(x, y, all=FALSE,by.x="Reporting.Countries.Partner.Countries",by.y="Reporting.Countries.Partner.Countries",all.x =TRUE, all.y =FALSE),import.list,accumulate=F)
 
dim(AllData00) # n = 211 211
 
# rename common key variable to something less awkward and change World to World00
 
AllData00 <- rename(AllData00, c(Reporting.Countries.Partner.Countries="Partner", World = "World00"))
names(AllData00) # list all variable names
 
# keep only the partner name variable and total world trade
 
AllData00 <-AllData00[c("Partner","World00")]
 
dim(AllData00) # data dimensions
names(AllData00) # variable names
fix(AllData00) # view in data editor
Created by Pretty R at inside-R.org

That  pretty well gives me the data set I need, for year 2000 data. I repeated the process for 2004 and 2008 data sets I had and then merged them with left joins to get the final data set. All I am after at this point is the total world trade for each of the countries/groups listed. This could probably be made even more efficient, but is is a lot less coding than what I initially used for the project. (see below- and this doesn't even include some of the renaming and sub-setting functions I performed above) And, this process would have to be repeated 2 more times for 2004 and 2008 data. To say that the above code is much more efficient is an understatement. (note the code below actually contains some mistakes as Hadley Wickham pointed out. For instance, in the merge statement, I have by.a and by.b, or by.'dataset', while in every case it should be by.x and by.y. I guess x and y are alias's for the data sets being merged, sort of like a and b would be in SQL, if you were to say:

create table newdataset as
select *
from dat1 a left join dat2 b
on a.partner=b.partner

So, I'm not sure why my code even worked to begin with. I do realize that instead of the merge statement in R I could have used the sqldf package in R, but I have had issues with my mac crashing when I try to load the library. Still, I don't think SQL would have made things any better, as I would still be doing a series of left joins vs. the more compact code using the reduce function in R. I've used sqldf in a windows environment before and it worked great by the way.

The code below first reads in each data file individually, and then executes the endless number of left joins to give me the same data set I got above with a fraction of the amount of required code. 

#  a
a <- read.csv("X_A.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(a)
 
a <- rename(a, c(Reporting.Countries.Partner.Countries="Partner"))
names(a)
dim(a)
 
# b
b <- read.csv("X_B.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(b)
 
b <- rename(b, c(Reporting.Countries.Partner.Countries="Partner"))
names(b)
dim(b)
 
# c
c <- read.csv("X_C.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(c)
 
c <- rename(c, c(Reporting.Countries.Partner.Countries="Partner"))
names(c)
dim(c)
 
# de
de <- read.csv("X_DE.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(de)
 
de <- rename(de, c(Reporting.Countries.Partner.Countries="Partner"))
names(de)
dim(de)
 
# fgh
fgh <- read.csv("X_FGH.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(fgh)
 
fgh <- rename(fgh, c(Reporting.Countries.Partner.Countries="Partner"))
names(fgh)
dim(fgh)
 
# ijk
ijk <- read.csv("X_IJK.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(c)
 
ijk <- rename(ijk, c(Reporting.Countries.Partner.Countries="Partner"))
names(ijk)
dim(ijk)
 
# lm
lm <- read.csv("X_LM.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(lm)
 
lm <- rename(lm, c(Reporting.Countries.Partner.Countries="Partner"))
names(lm)
dim(lm)
 
# nop
nop <- read.csv("X_NOP.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(nop)
 
nop <- rename(nop, c(Reporting.Countries.Partner.Countries="Partner"))
names(nop)
dim(nop)
 
# qr
qr <- read.csv("X_QR.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(qr)
 
qr <- rename(qr, c(Reporting.Countries.Partner.Countries="Partner"))
names(qr)
dim(qr)
 
 
# s odd name changed to 'SaloTomaPrincip' manaully in excel
s <- read.csv("X_S.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(s)
 
s <- rename(s, c(Reporting.Countries.Partner.Countries="Partner"))
names(s)
dim(s)
 
# tuv
tuv <- read.csv("EX_TUV.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(tuv)
 
tuv <- rename(tuv, c(Reporting.Countries.Partner.Countries="Partner"))
names(tuv)
dim(tuv)
 
 
# wxyz
wxyz <- read.csv("X_WXYZ.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
names(wxyz)
 
wxyz <- rename(wxyz, c(Reporting.Countries.Partner.Countries="Partner"))
names(wxyz)
dim(wxyz)
 
 
# ------------------------------------------------------------------
# sequentially left join data sets
# ------------------------------------------------------------------
 
 
# a & b
 
ab <- merge(a,b, by.a = Partner, by.b =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(ab)
names(ab)
14 + 18 - 1 # r = 211 c = 31
 
# abc
abc <- merge(ab,c, by.ab = Partner, by.c =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(abc)
names(abc)
31 + 24 -1 # n = 54
 
# a_e
a_e <- merge(abc,de, by.abc = Partner, by.de =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_e)
names(a_e)
54 + 18 -1 # n = 71
 
# a_h
a_h <- merge(a_e,fgh, by.a_e = Partner, by.fgh =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_h)
names(a_h)
71 + 23 -1 # n = 93
 
# a_k
a_k <- merge(a_h,ijk, by.a_h = Partner, by.ijk =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_k)
names(a_k)
93 + 16 -1 # n = 108
 
# a_m
a_m <- merge(a_k,lm, by.a_k = Partner, by.lm =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_m)
names(a_m)
108 + 26 - 1 # n = 133
 
# a_p
a_p <- merge(a_m,nop, by.a_m = Partner, by.nop =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_p)
names(a_p)
133 + 20 - 1 # n = 152
 
# a_r
a_r <- merge(a_p,qr, by.a_p = Partner, by.qr =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_r)
names(a_r)
152 + 6 -1 # n = 157
 
# a_s
a_s <- merge(a_r,s, by.a_r = Partner, by.s =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_s)
names(a_s)
157 + 27 - 1 # n = 183
 
# a_v
a_v <- merge(a_s,tuv, by.a_s = Partner, by.tuv =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_v)
names(a_v)
183 + 21 - 1 # n = 203
 
# a_z (complete data set after this merge)
a_z <- merge(a_v,wxyz, by.a_v = Partner, by.wxyz =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
dim(a_z) # n = 211
names(a_z)
Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Econometric Sense.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.