Testing for valid variable names

[This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.

Not having an is_valid_variable_name function is one of those odd omissions from R, and the assign function doesn’t check validity.

To recap, there are a few rules on what makes a valid variable name.  From ?name

Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).

The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of is* type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of

is_something <- function(x)
{
  if(!some_condition) return(FALSE)
  if(!some_other_condition) return(FALSE)
  #etc.
  return(TRUE)
}

The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:

is_something <- function(x)
{
  ok
  ok[!some_condition] <- FALSE
  ok[!some_other_condition] <- FALSE
  #etc.
  ok
}

So, back to our is_valid_variable_name function. The first condition is easy to implement.

is_valid_variable_name <- function(x)
{
  ok
  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  #More logic still to come
}

Now it gets trickier. In ?make.names we have

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.

When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.

In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The MSToolkit package contains a function validNames which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.

"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*$"

That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]“), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]“), then finish (“$”).

The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters áçöíþ <- 1 that the regex a-zA-Z won’t pick up. there’s an easy fix here that nearly always works. We swap 0-9a-zA-Z for [:alnum:] and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters µ but the “alpha” regex won’t match them.

grepl("[[:alpha:]]", "µ") # FALSE

Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.

Underscores aren't permitted...

validNames("foo_bar")  #throws error

and neither are names consisting only of dots...

validNames("..")       #throws error

but many of the reserved words (see ?Reserved for the list) are:

validNames("if")       #TRUE

I don't want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, ..., and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don't.

is_valid_variable_name <- function(x, allow_reserved = TRUE)
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE
  }

  #is it a reserved word?
  reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_")
  ok[grepl(paste(reserved_words, collapse = "|"), x)]

  #are there any illegal characters?
  ok[!grepl("^[[:alnum:]_.]+$", x)] <- FALSE

  #does it start with underscore?
  ok[grepl("^_", x)] <- FALSE

  #does it start with dot then a number?
  ok[grepl("^\\.[[:digit:]]", x)] <- FALSE

  ok
}

So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. make.names, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then x will be the same as make.names(x). As a bonus, we can easily check for unique variable names.

is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE)
{
  ok
  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE
  }

  #are names valid (and maybe unique)
  ok[x != make.names(x, unique = unique)] <- FALSE

  ok
}

While this answer isn't quite as satisfactory because you can't see what's going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then make.names will hopefully be updated to match it. And that makes it good enough for me.


Tagged: r, regex

To leave a comment for the author, please follow the link and comment on their blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)