Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers.

I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using `dplyr` for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly.

```## if you haven't already done so, load dplyr
library(dplyr)```

My example involved joining together two `tibble`s containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect? Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally.

If I had a `tibble` where the column I would use to `join` had integers

```dataA <- tribble(
~X, ~Y,
0L, 100L,
1L, 101L,
2L, 102L,
3L, 103L
)
dataA
## # A tibble: 4 x 2
##       X     Y
##   <int> <int>
## 1     0   100
## 2     1   101
## 3     2   102
## 4     3   103```

and another `tibble` with `numeric` in that column

```dataB <- tribble(
~X, ~Z,
0, 1000L,
1, 1001L,
2, 1002L,
3, 1003L
)
dataB
## # A tibble: 4 x 2
##       X     Z
##   <dbl> <int>
## 1     0  1000
## 2     1  1001
## 3     2  1002
## 4     3  1003```

would they still `join`?

```full_join(dataA, dataB)
## Joining, by = "X"
## # A tibble: 4 x 3
##       X     Y     Z
##   <dbl> <int> <int>
## 1     0   100  1000
## 2     1   101  1001
## 3     2   102  1002
## 4     3   103  1003```

Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal

```identical(0L, 0)
## [1] FALSE
identical(2L, 2)
## [1] FALSE```

though sometimes it does

```0L == 0
## [1] TRUE
2L == 2
## [1] TRUE```

(`==` coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?)

```dataC <- tribble(
~X, ~Z,
"0", 100L,
"1", 101L,
"2", 102L,
"3", 103L
)
dataC
## # A tibble: 4 x 2
##   X         Z
##   <chr> <int>
## 1 0       100
## 2 1       101
## 3 2       102
## 4 3       103
full_join(dataA, dataC)
## Joining, by = "X"
## Error: Can't join on 'X' x 'X' because of incompatible types (character / integer)```

That’s probably wise. Of course, R is perfectly happy with things like

```"2":"5"
## [1] 2 3 4 5```

and `==` thinks that’s fine

```"0" == 0L
## [1] TRUE
"2" == 2L
## [1] TRUE```

but who am I to argue?

Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? `.Machine\$double.eps` is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

```dataBeps <- tribble(
~X, ~Z,
0 + .Machine\$double.eps, 1000L,
1 + .Machine\$double.eps, 1001L,
2 + .Machine\$double.eps, 1002L,
3 + .Machine\$double.eps, 1003L
)
dataBeps
## # A tibble: 4 x 2
##          X     Z
##      <dbl> <int>
## 1 2.22e-16  1000
## 2 1.00e+ 0  1001
## 3 2.00e+ 0  1002
## 4 3.00e+ 0  1003```

Well, that’s… weirder. The values offset from `2` and `3` joined fine, but the `0` and `1` each got multiple copies since R thinks they’re different. What if we offset a little further?

```dataB2eps <- tribble(
~X, ~Z,
0 + 2*.Machine\$double.eps, 1000L,
1 + 2*.Machine\$double.eps, 1001L,
2 + 2*.Machine\$double.eps, 1002L,
3 + 2*.Machine\$double.eps, 1003L
)
dataB2eps
## # A tibble: 4 x 2
##          X     Z
##      <dbl> <int>
## 1 4.44e-16  1000
## 2 1.00e+ 0  1001
## 3 2.00e+ 0  1002
## 4 3.00e+ 0  1003```

That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values `0:4`, let’s compare that integer with the same offset by `.Machine\$double.eps`

```library(purrr) ## for the 'thou shalt not for-loop' crowd
map_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine\$double.eps)
## [1] FALSE FALSE  TRUE  TRUE  TRUE```

And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

For what it’s worth, I’m sure this is hardly a surprising detail to the `dplyr` team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

Update: As noted in the comments, R does have a way to check if things are ‘nearly equal’ (within some specified tolerance) via `all.equal()`

```purrr::map_lgl(0:4, ~all.equal(.x, .x + .Machine\$double.eps))
## [1] TRUE TRUE TRUE TRUE TRUE```

However, this does require the user to either specify the exact tolerance under which they consider two numbers ‘equal’, or to use the default (which, judging by the source of `all.equal.numeric()` is `sqrt(.Machine\$double.eps)` or around 1.490116110^{-8} on this system). This means that numbers can be ‘quite’ different (depending on what’s an important difference) and still considered equal

```purrr::map_lgl(0:4, ~ all.equal(.x, .x + 1e-8))
## [1] TRUE TRUE TRUE TRUE TRUE```

devtools::session_info()

```## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_AU:en
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  date     2019-08-13
##
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.5.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)
##  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.5.1)
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)
##  purrr       * 0.3.2   2019-03-15 [1] CRAN (R 3.5.2)
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)
##  tibble      * 2.1.3   2019-06-06 [1] CRAN (R 3.5.2)
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)
##  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.5.1)
##  vctrs         0.2.0   2019-07-05 [1] CRAN (R 3.5.2)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)
##  zeallot       0.1.0   2018-01-28 [1] CRAN (R 3.5.2)
##
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library```

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.