[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Base R has a `merge` function which does join operations on data frames. As the documentation says, the function

[merges] two data frames by common columns or row names, or do other versions of database join operations.

One thing that I realized which may not be obvious is that `merge` can have somewhat unexpected behavior regarding the ordering of rows in the result. Let’s see an example with the `mtcars` dataset:

```data(mtcars)
mtcars\$ID <- 1:nrow(mtcars)
head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6
```

Let’s say we want to add a row `cyl_word` which is the value in the `cyl` column in words (e.g. `6 -> "Six"`). The code below creates a reference table and joins the two tables to get the desired result:

```cyl_df <- data.frame(cyl = c(4, 6, 8),
cyl_word = c("four", "six", "eight"))

joined_df <- merge(mtcars, cyl_df, by = "cyl")
head(joined_df)
#   cyl  mpg  disp hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   4 22.8 140.8 95 3.92 3.150 22.90  1  0    4    2  9     four
# 2   4 22.8 108.0 93 3.85 2.320 18.61  1  1    4    1  3     four
# 3   4 24.4 146.7 62 3.69 3.190 20.00  1  0    4    2  8     four
# 4   4 21.5 120.1 97 3.70 2.465 20.01  1  0    3    1 21     four
# 5   4 30.4  75.7 52 4.93 1.615 18.52  1  1    4    2 19     four
# 6   4 33.9  71.1 65 4.22 1.835 19.90  1  1    4    1 20     four
```

Look at the `ID` column: the rows are not returned in the same order! Reading the documentation tells us that `merge` has an argument `sort` which has default value `TRUE`, meaning that the results are sorted by the values in the columns that we merged on (`cyl` in this case). That is why all the `cyl == 4` rows appear first in the return value.

(Notice also that the row names have disappeared: this may not be something you want!)

One might think that if we set `sort = FALSE`, the output would have the same row ordering as the input. Unexpectedly (at least to me), this is not the case:

```joined_df <- merge(mtcars, cyl_df, by = "cyl", sort = FALSE)
head(joined_df)
#   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4 11      six
# 4   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1  6      six
# 6   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4 10      six
```

The documentation says as much: under the “Value” section, it writes that (emphasis mine)

The rows are by default lexicographically sorted on the common columns, but for `sort = FALSE` are in an unspecified order.

How can we get the output back in the original row order? One way is to add an `ID` column like we did above, then sort the result by that column:

```joined_df <- joined_df[order(joined_df\$ID), ]
head(joined_df)
#    cyl  mpg disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1    6 21.0  160 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2    6 21.0  160 110 3.90 2.875 17.02  0  1    4    4  2      six
# 9    4 22.8  108  93 3.85 2.320 18.61  1  1    4    1  3     four
# 4    6 21.4  258 110 3.08 3.215 19.44  1  0    3    1  4      six
# 19   8 18.7  360 175 3.15 3.440 17.02  0  0    3    2  5    eight
# 5    6 18.1  225 105 2.76 3.460 20.22  1  0    3    1  6      six

testthat::expect_equal(joined_df\$ID, 1:nrow(joined_df))  # test passes
```

Alternatively, use functions in the `dplyr` package:

```library(dplyr)
joined_df <- left_join(mtcars, cyl_df, by = "cyl")
head(joined_df)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3     four
# 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5    eight
# 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6      six

testthat::expect_equal(joined_df\$ID, 1:nrow(joined_df))  # test passes
```

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)