Adding strings in R

[This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This started out as a “hey, I wonder…” sort of thing, but as usual, they tend to end up as interesting voyages into the deepest depths of code, so I thought I’d write it up and share. Shoutout to @coolbutuseless for proving that a little curiosity can go a long way and inspiring me to keep digging into interesting topics.

This is what you get if you “glue” “strings”. Photo: https://craftwhack.com/cool-craft-string-easter-eggs/

This post came across my feed last week, referring to the roperators package on CRAN. In that post, the author introduces an infix operator from that package which ‘adds’ (concatenates/pastes) strings

"using infix (%) operators" %+% "R can do simple string addition"
#> [1] "using infix (%) operators R can do simple string addition"

This might be familiar if you use python

>>> "python " + "adds " + "strings"
'python adds strings'

or javascript

"javascript " + "also adds " + "strings"
"javascript also adds strings"

or perhaps even go

package main

import "fmt"

func main() {
  fmt.Println("go " + "even adds " + "strings")
}
> "go even adds strings"

but this is not something natively available in R

"this doesn't" + "work"
#> Error in "this doesn't" + "work" : 
#>  non-numeric argument to binary operator

Could we make it work, though? That got me wondering. My first guess was to just create a new + function which does allow for this. The normal addition operator is

`+`
#> function (e1, e2)  .Primitive("+")

so a first attempt might be

`+` <- function(e1, e2) {
  if (is.character(e1) | is.character(e2)) {
    paste0(e1, e2)
  } else {
    base::`+`(e1, e2)
  }
}

This checks to see if the left or right side of the operator is a character-classed object, and if either is, it pastes the two together. Otherwise it just uses the ‘regular’ addition operator between the two arguments. This works for simple cases, e.g.

"a" + "b"
#> [1] "ab"

"a" + 2
#> [1] "a2"

2 + 2
#> [1] 4

2 + "a"
#> [1] "2a"

But we hit an important snag if we try to add to character-represented numbers

"200" + "200"
#> [1] "200200"

That’s probably going to be an issue if we read in unformatted data (e.g. from a CSV) as characters and try to treat it like numbers. Normally this would throw the above error about not being numeric, but now we get a silent weird number-character. That’s no good.

An extension to this checks whether or not we have the number-as-a-character situation and falls back to the correct interpretation in that case

`+` <- function(e1, e2) {
  ## unary
  if (missing(e2)) return(e1)
  if (!is.na(suppressWarnings(as.numeric(e1))) & !is.na(suppressWarnings(as.numeric(e2)))) {
    ## both arguments numeric-like but characters
    return(base::`+`(as.numeric(e1), as.numeric(e2)))
  } else if ((is.character(e1) & is.na(suppressWarnings(as.numeric(e1)))) | 
             (is.character(e2) & is.na(suppressWarnings(as.numeric(e2))))) {
    ## at least one true character 
    return(paste0(e1, e2))
  } else {
    ## both numeric
    return(base::`+`(e1, e2))
  }
}

"a" + "b"
#> [1] "ab"

"a" + 2
#> [1] "a2"

2 + 2
#> [1] 4

2 + "a"
#> [1] "2a"

"2" + "2"
#> [1] 4

2 + "edgy" + 4 + "me"
#> [1] "2edgy4me"

So, that’s one option for string addition in R. Is it the right one? The idea of actually dispatching on a character class is inviting. Can we just add a +.character method (since there doesn’t seem to already be one)? Normally when we have S3 dispatch we need a generic function, which calls UseMethod("class"), but we don’t have that in this case. + is an internal generic, which is probably the first sign that we’re going to have trouble. If we try to define the method

`+.character` <- function(e1, e2) {
  paste0(e1, e2)
}
"a" + "b"
#> Error in "a" + "b" : non-numeric argument to binary operator

It seems to fail. What went wrong? Is dispatch not working?

via GIPHY

We want to dispatch on “character” — is that what we have?

class("a")
#> [1] "character"

What if we explicitly create an object with that class?

structure("a", class = "character") + 2
#> [1] "a2
2 + structure("a", class = "character")
#> [1] "2a"

What if we try to dispatch on some new class?

`+.foo` <- function(e1, e2) {
  paste0(e1, e2)
}
structure("a", class = "foo") + 2
#> [1] "a2

but no dice for just a regular atomic character object. Time to revisit the help pages.

In R, addition is limited to particular classes of objects, defined by the Ops group (there are also Math, Summary, and Complex groups). The methods for the Ops group members describe which classes can be involved in operations involving any of the Ops group members:
"+", "-", "*", "/", "^", "%%", "%/%" "&", "|", "!" "==", "!=", ""

These methods are:

methods("Ops")
 [1] Ops,array,array-method              
 [2] Ops,array,structure-method          
 [3] Ops,nonStructure,nonStructure-method
 [4] Ops,nonStructure,vector-method      
 [5] Ops,structure,array-method          
 [6] Ops,structure,structure-method      
 [7] Ops,structure,vector-method         
 [8] Ops,vector,nonStructure-method      
 [9] Ops,vector,structure-method         
[10] Ops.data.frame                      
[11] Ops.data.table*                     
[12] Ops.Date                            
[13] Ops.difftime                        
[14] Ops.factor                          
[15] Ops.numeric_version                 
[16] Ops.ordered                         
[17] Ops.POSIXt                          
[18] Ops.raster*                         
[19] Ops.roman*                          
[20] Ops.ts*                             
[21] Ops.unit*             

What’s missing from this list, in order for us to be able to just use “string” + “string” is a character method. What’s perhaps even more surprising is that there is a roman method! Whaaaat?

as.roman("1") + as.roman("5")
#> [1] VI
as.roman("2000") + as.roman("18")
#> [1] MMXVIII

Since the operations need to be defined for all the members of the Ops group, we would also need to define what to do with, say, * between strings. When one side is a string and the other is a number, a reasonable approach might be that which was taken in the original post (using a new infix %s*%)

"a" %s*% 3
#> [1] "aaa" 

There is, of course, a function to do this already

strrep("a", 3)
#> [1] "aaa" 

but I could see creating "a" * 3 as a shortcut to this. I don’t know what one would expect "a" * "b" to produce.

The problem with where this is heading is that we aren’t allowed to create the method for an atomic class, as Joris Meys and Brodie Gaslam point out on Twitter

setMethod("+", c("character", "character"), function(e1, e2) paste0(e1, e2))
#> Error in setMethod("+", c("character", "character"), function(e1, e2) paste0(e1,  : 
#>   the method for function ‘+’ and signature e1="character", e2="character" is sealed and cannot be re-defined

so no luck there. Brodie also links to a Stack Overflow discussion on this very topic where it is pointed out by Martin Mächler that this has been discussed on r-devel — that makes for some interesting historical weigh-ins on why this isn’t a thing in R. Incidentally, the small-world effect comes into play regarding that Stack Overflow post as one of the three answers happens to be a former work colleague of mine.

So, in the end, it seems the best we can do is the rather long-winded overwrite of + which checks if the arguments really are characters. I don’t mind this, and would probably use it if it was in base R or a package. The biggest issue that people seem to have with this is that it ‘looks like’ addition, but it’s not commutative. If that word is new to you, it just means that x + y should give the same answer as y + x. For numbers, the regular + satisfies this:

2 + 3
#> [1] 5
3 + 2
#> [1] 5

but when we try to do this with strings… not so much

"a" + "b"
#> [1] "ab"
"b" + "a"
#> [1] "ba"

This doesn’t particularly bother me, because I’m okay with this not actually being ‘mathematical addition’. The fun turn this then took was the suggestion from Joris Meys that Julia’s non-associative operators is a strength of the language. There, the way that you group values matters

a + b + c is parsed as +(a, b, c) not +(+(a, b), c).

I’ll eventually get around to learning more Julia, but this is already hurting my brain.

That distinction may be of interest, however, to Miles McBain, whose concern was more about repeated applications of + being a bottleneck

In that case, parsing as +("a", "b", "c") is exactly what would be desired.

So, what’s the conclusion of all of this? I’ve learned (and re-learned) a heap more about how the Ops group works, I’ve played a lot with dispatch, and I’ve thought deeply about edge-cases for adding strings. I’ve also been exposed to a bit more Julia. All in all, a worthwhile dive into something potentially silly, but a lot of fun. If you have some thoughts on the matter, leave a comment here or reply on Twitter — I’d love to hear about another angle to this story.

To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)