Yesterday I launched my first question at Stackoverflow and apparently did a lot of things wrong as I managed to get my question closed wihtin hours
I had collected 9 different solutions to the problem and made the mistake to put it all within the original question space. So people complained and told me that such a collection belongs into a blog and not on Stackoverflow. Hence, I decided to go ahead and finally make my first blogging steps! I’ll probably still miss out on a lot of cool technical stuff on the blog, so please bear with me.
So here’s the thing:
Problem
How to identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?
Example Data
> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5)) > x.1 a b 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 > x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4)) > x.2 a b 1 1 1 2 1 1 3 2 99 4 3 3 5 4 4 # BENCHMARK SETTINGS require(microbenchmark) to.sec <- 1000000000
Desired Result
a b 2 2 2 5 5 5
Best Solution So Far
by Prof. Brian Ripley and Gabor Grothendieck
> fun.12 <- function(x.1,x.2,...){
+ x.1p <- do.call("paste", x.1)
+ x.2p <- do.call("paste", x.2)
+ x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/to.sec
> sol.12
> [1] 0.0002059665
Efficiency Comparison
> sol.scope <- 1:13
> comp <- data.frame(lapply(sol.scope, function(x){
+ eval(substitute(get(SOL), list(SOL=paste("sol.", x, sep=""))))
+ }))
> names(comp) <- paste("solution", sol.scope)
> comp <- as.data.frame(t(comp[order(comp[1,])]))
> colnames(comp) <- "time"
> comp
time time.rel
solution 12 0.0002080150 1.000000
solution 3 0.0002548310 1.225061
solution 1 0.0004656785 2.238677
solution 10 0.0006398950 3.076196
solution 6 0.0007878430 3.787434
solution 8 0.0010459795 5.028385
solution 11 0.0021617355 10.392210
solution 9 0.0025755710 12.381660
solution 7 0.0103444610 49.729399
solution 13 0.0211265200 101.562483
solution 5 0.0225685395 108.494770
solution 2 NA NA
solution 4 NA NA
Here are all the solutions I collected so far:
Solution 1
by Chase
> fun.1 <- function(x.1,x.2,...){
+ expr <- paste("subset(x.1,", paste(sapply(names(x.1), function(x){
+ paste("!(", x, " %in% x.2$", x, ")", sep="")
+ }), collapse=" | "), ")")
+ eval(parse(text=expr))
+ }
> fun.1(x.1,x.2)
a b
2 2 2
5 5 5
> sol.1 <- microbenchmark(fun.1(x.1,x.2))
> sol.1 <- median(sol.1$time)/to.sec
> sol.1
[1] 0.0004656785
Solution 2 (rather just an approach)
by Ramnath
> setdiff(x.1$a, x.2$a) # elements in x.1$a NOT in x.2$a [1] 5 > setdiff(x.2$a, x.1$a) # elements in x.2$a NOT in x.1$a numeric(0) > sol.2 <- microbenchmark(setdiff(x.1$a, x.2$a)) > sol.2 <- median(sol.2$time)/to.sec > sol.2 [1] 4.03865e-05 # This helps, but as it does not directly provide the solution, we will set this to 'NA' sol.2 <- NA
Solution 3
by Prof. Brian Ripley
> fun.3 <- function(x.1, x.2, ...){
+ x.1.id <- do.call("paste", c(x.1, sep = "\r"))
+ x.2.id <- do.call("paste", c(x.2, sep = "\r"))
+ x.1[match(setdiff(x.1.id, x.2.id),x.1.id), ]
+ }
> fun.3(x.1,x.2)
a b
1 2 2
2 5 5
> sol.3 <- microbenchmark(fun.3(x.1,x.2))
> sol.3 <- median(sol.3$time)/to.sec
> sol.3
[1] 0.000254831
Solution 4 (rather just an approach)
by me
> fun.4 <- function(x.1, x.2, ...){
+ # Combine
+ df <- rbind(x.1,x.2)
+ df <- df[order(df[,1]),]
+ # Find duplicates
+ idx.1 <- duplicated(df, all=TRUE)
+ idx.2 <- duplicated(df, fromLast=TRUE)
+ idx <- cbind(idx.1, idx.2)
+ idx <- apply(idx, MARGIN=1, any)
+ # Index records
+ df[-which(idx),]
+ }
> fun.4(x.1,x.2)
a b
1 2 2
8 2 99
2 5 5
> sol.4 <- microbenchmark(fun.4(x.1,x.2))
> sol.4 <- median(sol.4$time)/to.sec
> sol.4
[1] 0.001062621
# As it does not match the desired records, we set this to 'NA'
sol.4 <- NA
Solution 5
base on solution by Gabor Grothendieck
> library(sqldf)
> fun.5 <- function(x1,x2,...){
+ out <- sqldf(
+ "SELECT * FROM x1
+ WHERE
+ x1.a NOT IN (SELECT x2.a FROM x2) OR
+ x1.b NOT IN (SELECT x2.b FROM x2)"
+ )
+ out
+ }
> fun.5(x1=x.1,x2=x.2)
a b
1 2 2
2 5 5
> sol.5 <- microbenchmark(fun.5(x1=x.1,x2=x.2))
> sol.5 <- median(sol.5$time)/to.sec
> sol.5
[1] 0.02256854
Solution 6
by Tal Galili
> fun.6 <- function(x.1,x.2){
+ x.1.vec <- apply(x.1, 1, paste, collapse = "")
+ x.2.vec <- apply(x.2, 1, paste, collapse = "")
+ x.1.without.x.2.rows <- x.1[!x.1.vec %in% x.2.vec,]
+ return(x.1.without.x.2.rows)
+ }
> fun.6(x.1,x.2)
a b
1 2 2
2 5 5
> sol.6 <- microbenchmark(fun.6(x.1,x.2))
> sol.6 <- median(sol.6$time)/to.sec
> sol.6
[1] 0.000787843
Solution 7
by nullglob
# COULD NOT REPRODUCE RESULTS WITH MY DATA
> fun.7 <- function(x.1,x.2,...){
+ a1 <- data.frame(a = 1:5, b = letters[1:5])
+ a2 <- data.frame(a = 1:3, b = letters[1:3])
+ comparison <- compare(a1,a2,allowAll=TRUE)
+ comparison$tM
+ difference <-
+ data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
+ colnames(difference) <- colnames(a1)
+ difference
+ }
> fun.7(x.1,x.2)
a b
1 4 d
2 5 e
> ### DISCUSSION ###
> # 1) Effectiveness
> # Could not reproduce results with my data frames.
> # 2) Efficiency
> sol.7 <- microbenchmark(fun.7(x.1,x.2))
> sol.7 <- median(sol.7$time)/to.sec
> sol.7
[1] 0.01034446
Solution 8
by Henrico
# Derived from src/library/base/R/merge.R
# Part of the R package, http://www.R-project.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# A copy of the GNU General Public License is available at
# http://www.r-project.org/Licenses/
> XinY <-
+ function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
+ notin = FALSE, incomparables = NULL,
+ ...)
+ {
+ fix.by <- function(by, df)
+ {
+ ## fix up 'by' to be a valid set of cols by number: 0 is row.names
+ if(is.null(by)) by <- numeric(0L)
+ by <- as.vector(by)
+ nc <- ncol(df)
+ if(is.character(by))
+ by <- match(by, c("row.names", names(df))) - 1L
+ else if(is.numeric(by)) {
+ if(any(by < 0L) || any(by > nc))
+ stop("'by' must match numbers of columns")
+ } else if(is.logical(by)) {
+ if(length(by) != nc) stop("'by' must match number of columns")
+ by <- seq_along(by)[by]
+ } else stop("'by' must specify column(s) as numbers, names or logical")
+ if(any(is.na(by))) stop("'by' must specify valid column(s)")
+ unique(by)
+ }
+ nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
+ by.x <- fix.by(by.x, x)
+ by.y <- fix.by(by.y, y)
+ if((l.b <- length(by.x)) != length(by.y))
+ stop("'by.x' and 'by.y' specify different numbers of columns")
+ if(l.b == 0L) {
+ ## was: stop("no columns to match on")
+ ## returns x
+ x
+ }
+ else {
+ if(any(by.x == 0L)) {
+ x <- cbind(Row.names = I(row.names(x)), x)
+ by.x <- by.x + 1L
+ }
+ if(any(by.y == 0L)) {
+ y <- cbind(Row.names = I(row.names(y)), y)
+ by.y <- by.y + 1L
+ }
+ ## create keys from 'by' columns:
+ if(l.b == 1L) { # (be faster)
+ bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
+ by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
+ } else {
+ ## Do these together for consistency in as.character.
+ ## Use same set of names.
+ bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
+ names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
+ bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
+ bx <- bz[seq_len(nx)]
+ by <- bz[nx + seq_len(ny)]
+ }
+ comm <- match(bx, by, 0L)
+ if (notin) {
+ res <- x[comm == 0,]
+ } else {
+ res <- x[comm > 0,]
+ }
+ }
+ ## avoid a copy
+ ## row.names(res) <- NULL
+ attr(res, "row.names") <- .set_row_names(nrow(res))
+ res
+ }
> XnotinY <-
+ function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
+ notin = TRUE, incomparables = NULL,
+ ...)
+ {
+ XinY(x,y,by,by.x,by.y,notin,incomparables)
+ }
> fun.8 <- XnotinY
> fun.8(x.1,x.2)
a b
1 2 2
2 5 5
> sol.8 <- microbenchmark(fun.8(x.1,x.2))
> sol.8 <- median(sol.8$time)/to.sec
> sol.8
[1] 0.00104598
Solution 9
by Tomas T.
> fun.9 <- function(x.1,x.2,...){
+ tmp = merge(x.1, cbind(x.2, q=1:nrow(x.2)), all.x = TRUE)
+ # provided that there's no column q in both dataframes
+ tmp[is.na(tmp$q), 1:ncol(x.1)] # the result
+ }
> fun.9(x.1,x.2)
a b
1 2 2
2 5 5
> sol.9 <- microbenchmark(fun.9(x.1,x.2))
> sol.9 <- median(sol.9$time)/to.sec
> sol.9
[1] 0.002575571
Solution 10
> fun.10 <- function(x.1,x.2,...){
+ x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]
+ }
> fun.10(x.1,x.2)
a b
1 2 2
2 5 5
> sol.10 <- microbenchmark(fun.10(x.1,x.2))
> sol.10 <- median(sol.10$time)/to.sec
> sol.10
[1] 0.000639895
Solution 11
> fun.11 <- function(x.1,x.2,...){
+ do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))
+ }
> fun.11(x.1,x.2)
a b
1 2 2
2 5 5
> sol.11 <- microbenchmark(fun.11(x.1,x.2))
> sol.11 <- median(sol.11$time)/to.sec
> sol.11
[1] 0.002161735
Solution 12
> fun.12 <- function(x.1,x.2,...){
+ x.1p <- do.call("paste", x.1)
+ x.2p <- do.call("paste", x.2)
+ x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
1 2 2
2 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/to.sec
> sol.12
[1] 0.000208015
Solution 13
> library(sqldf)
> fun.13 <- function(x.1,x.2,...){
+ sqldf("select * from `x.1` except select * from `x.2`")
+ }
> fun.13(x.1,x.2)
a b
1 2 2
2 5 5
> sol.13 <- microbenchmark(fun.13(x.1,x.2))
> sol.13 <- median(sol.13$time)/to.sec
> sol.13
[1] 0.02112652
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).