Union Multiple Data.Frames with Different Column Names

September 22, 2018
By

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed before (https://statcompute.wordpress.com/2018/09/08/playing-map-and-reduce-in-r-subsetting).

rbind2 <- function(lst) {
  h <- unique(unlist(lapply(lst, names)))
  Reduce(rbind, parallel::mcMap(function(x) {x[, setdiff(h, names(x))] <- NA; return(x)}, lst, mc.cores = length(lst)))
}

On Saturday, when I revisited the problem, I found a very good thread on the stackoverflow (https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns) discussing various approaches addressing my problem yesterday. Out of curiosity, I did a comparison between the rbind2() and discussed approaches by combining 8 data.frames each with a million records. As shown in the plot, my homebrew rbind2() function is only marginally faster than the gtools::smartbind() function and the dplyr::bind_rows function is the most efficient.

n <- 1000000
d1 <- data.frame(id = 1:n, x1 = 1)
d2 <- data.frame(id = 1:n, x2 = 2)
d3 <- data.frame(id = 1:n, x3 = 3)
d4 <- data.frame(id = 1:n, x4 = 4)
d5 <- data.frame(id = 1:n, x5 = 5)
d6 <- data.frame(id = 1:n, x6 = 6)
d7 <- data.frame(id = 1:n, x7 = 7)
d8 <- data.frame(id = 1:n, x8 = 8)
microbenchmark::microbenchmark(times = 10, 
  "homebrew::rbind2"      = {rbind2(list(d1, d2, d3, d4, d5, d6, d7, d8))},
  "gtools::smartbind"     = {gtools::smartbind(list = list(d1, d2, d3, d4, d5, d6, d7, d8))},
  "dplyr::bind_rows"      = {dplyr::bind_rows(d1, d2, d3, d4, d5, d6, d7, d8)},
  "plyr::rbind.fill"      = {plyr::rbind.fill(d1, d2, d3, d4, d5, d6, d7, d8)},
  "data.table::rbindlist" = {data.table::rbindlist(list(d1, d2, d3, d4, d5, d6, d7, d8), fill = T)}
)

Rplot

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)