**Jason.Bryer.org Blog - R**, and kindly contributed to R-bloggers)

I posted a question over on StackOverflow on an efficient way of comparing two data frames with the same column structure, but with different rows. What I would like to end up with is an *n* x *m* logical matrix where *n* and *m* are the number of rows in the first and second data frames, respectively; and the value at the *i*th row and *j*th column indicates whether all the values from row *i* from data frame one is equal to row *j* from data frame two. To provide some context, this will be used in a propensity score matching algorithm to identify candidate matches that match exactly on any number of covariates. In addition to the approaches I had, joran provided an approach using the `Vectorize`

function (thanks again as I learned another nice function). I decided to put three approaches to a race…

To understand what I need, I’ll start with a small example with two data frames, one with 4 rows, the other with 3, and each has two variables, one logical and the other numeric. As an aside, I only need this to work for integers, factors, characters, and logical types therefore avoiding issues of comparing numerics.

```
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
> df1
var1 var2
1 TRUE 1
2 TRUE 2
3 FALSE 3
4 FALSE 4
> df2
var1 var2
5 FALSE 5
6 TRUE 2
7 FALSE 3
```

First, let’s consider the case when there is only one variable:

```
> system.time({
+ df3 <- sapply(df2$var1, FUN=function(x) { x == df1$var1 })
+ dimnames(df3) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
0 0 0
> df3
5 6 7
1 FALSE TRUE FALSE
2 FALSE TRUE FALSE
3 TRUE FALSE TRUE
4 TRUE FALSE TRUE
```

This is pretty straight forward. Now I want the same type of result, but to compare more than one column (in the final implementation I need to handle any number of columns so not necessarily limited to one or two).

The first approach uses nested apply functions.

```
> system.time({
+ m1 <- t(as.matrix(df1))
+ m2 <- as.matrix(df2)
+ df4 <- apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x == y) } ) })
+ })
user system elapsed
0.001 0.000 0.001
> df4
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
```

Secondly, using the `Vectorize`

and `outer`

functions.

```
> system.time({
+ foo <- Vectorize(function(x,y) { all(df1[x,] == df2[y,]) })
+ df5 <- outer(1:nrow(df1), 1:nrow(df2), FUN=foo)
+ })
user system elapsed
0.005 0.000 0.006
> df5
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
```

Lastly, we’ll create a new character vector by pasting the other variables together.

```
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
0.000 0.000 0.001
> df6
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
```

We can already see with this small example that the `Vectorize`

approach is the slowest. However, let’s try a larger example. First we’ll create two data frames, one with 1,000 rows and the second with 1,500. The resulting matrix will be 1,000 x 1,500.

```
set.seed(2112)
df1 <- data.frame(row.names=1:1000,
var1=sample(c(TRUE,FALSE), 1000, replace=TRUE),
var2=sample(1:10, 1000, replace=TRUE) )
df2 <- data.frame(row.names=1001:2500,
var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
var2=sample(1:10, 1500, replace=TRUE))
```

Nested `apply`

functions approach:

```
> system.time({
+ m1 <- t(as.matrix(df1))
+ m2 <- as.matrix(df2)
+ df4 <- apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x == y) } ) })
+ })
user system elapsed
10.807 0.043 11.096
```

`Vectorize`

approach:

```
> system.time({
+ foo <- Vectorize(function(x,y) { all(df1[x,] == df2[y,]) })
+ df5 <- outer(1:nrow(df1), 1:nrow(df2), FUN=foo)
+ })
user system elapsed
390.904 0.808 392.134
```

Combined columns approach:

```
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
0.421 0.000 0.422
```

The combined column approach is by far the fasted way, and it makes good since. It is a bit surprising (at least to me), how much worse the `Vectorize`

and `outer`

functions are. Moreover, I am a bit concerned about potential issues with the `paste`

method and doing comparisons on those results. Please feel free to leave comments below if there are other approaches.

**leave a comment**for the author, please follow the link and comment on their blog:

**Jason.Bryer.org Blog - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...