The future of R syntax?

February 14, 2016
By

(This article was first published on - rstats, and kindly contributed to R-bloggers)

Following Romain François's
example
,
I spent last week playing with the definition of the R grammar. I
focused on four changes that I think would improve existing R idioms:
creating lists with bare square brackets; a compact lambda notation;
labelled blocks of code; and of course implementing natively the pipe
operator. While none of these changes are strictly necessary, they
make the language more comfortable to use and nicer to look at. I
provide working implementations for all of them in the brackets,
brackets-lambda, labelled and pipe branches at
https://github.com/lionel-/r-source.

Bare Square Brackets

Advanced treatments of R programming stress that R is a functional
language. This essentially means that functions are first-class
citizens and that you can pass them as arguments to other
functions. This makes it possible to have the apply family of
functions in base R or the map family in purrr. By the same token,
this makes lists extremely useful in R. They can contain any kind of
objects and you can use functional programming techniques to
manipulate them with expressive idioms. In addition, since lists
elements are associated with names, they can directly map to the
arguments of a function call via do.call() or purrr::invoke(),
another key idiom of functional programming in R.

Despite their importance in the R language, lists do not benefit from
as much syntax sugar as in other languages. Hence my first change to
the R syntax
:
creating lists with bare square brackets:

[3, 4, letters]
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
#> [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

[3, 4, letters] %>% map_lgl(is.double)
#> [1]  TRUE  TRUE FALSE

This can greatly improve code clarity. Compare dense nested list
constructs such as

list(
  list(1, 2),
  list(3, 4)
)

to the much lighter and cleaner

[[1, 2], [3, 4]]

An important use case that would also benefit from this syntax is when
a function needs some additional arguments in the form of a
list. Think of the contrasts argument of lm() or the args
argument of ggplot2's stat_function(). They both involve passing a
list of arguments, which bloats the calls and makes scripts heavier to
read. The bare brackets notation is a bit lighter:

mtcars$cyl <- as.factor(mtcars$cyl)

# Specific contrast for the predictor `cyl`
lm(disp ~ cyl + am, data = mtcars, contrasts = [cyl = contr.sum])

I also have a feeling that bare brackets may be useful to come up with
clean creative syntax in DSLs. Like any syntax construct in R, the
square brackets are represented as a plain text function. For example
instead of mtcars[["cyl"]], you can write
`[[`(mtcars, "cyl")
. The string for bare brackets is `[] ` and
allows you to redefine its functionality as follows:

`[]` <- function(...) "hello"

[3, 4]
#> [1] "hello"

By the same token, DSLs could capture bare brackets and give them some
specific meaning.

Finally, some additional syntax rule could allow for list
comprehensions by looking up the for keyword inside bare
brackets. This would enable this kind of python-style code:

# List comprehension:
[sum(x)^2 for x in mtcars]

# Equivalent to the following map:
mtcars %>% map(function(x) sum(x)^2)

However I think that's going a step too far as the functional version
is much more R-like.

Lambda Notation

In R functions can be created, given names and passed around. But a
common idiom involves creating anonymous functions (lambda functions)
on the fly. As the full syntax for defining a function can be
cumbersome in those situations, many languages such as
Scala,
Haskell,
F-Sharp,
Python,
and even C++
support a compact notation for creating lambdas. Given the importance
of lambda functions in R (as in the apply family of functions), it
would be particularly nice to provide an elegant notation for creating
them. The second syntax
update
,
relies on the bare square brackets notation for that purpose.

The notation is based on the rightward assignment ->, an operator
that is barely used in practice because it's a bit confusing. Bare
square brackets followed by -> followed by any R expression create a
function in place:

[x] -> 3 * x
#> [x] -> 3 * x

([x] -> 3 * x)(5)
#> [1] 15

lapply(cars, [col] -> max(col / sum(col)))
#> $speed
#> [1] 0.03246753
#>
#> $dist
#> [1] 0.05583993

This notation supports variadic lambdas by supplying dots:

variadic <- [...] -> {
  sum <- ..1 + ..2
  sum * 3
}

variadic(3, 4)
# [1] 21


variadic2 <- [x, y, ...] -> length(list(...))

variadic2("a", "b", 1, 2, 3)
# [1] 3

Thanks to operator precedence and the left associativity of ->,
usual R rules for assignment apply. The following snippet assigns the
lambda first to byproduct, then to fun.

fun <- [x] -> x -> byproduct

Labelled Blocks

In R, code is data. When a function is called, its arguments are
usually evaluated and assigned to the parameter. But functions can
also request to see the code used to compute that value in the form of
a quoted expression. This
capacity to capture code is invaluable to creating intuitive
sublanguages like dplyr or ggplot2. The third
change

that I introduce to R's syntax focuses on the subset of DSLs that
manipulate blocks of code, such as the great testthat package.

Currently, blocks of code are passed to a function via curly brackets:

test_that("my code works", {
  ...
})

Wouldn't it be nicer to have the same syntax as function definitions,
for loops and if-else branches? That's the purpose of this second syntax
change. It allows you to write:

test_that("my code works") {
  ...
}

That's a fairly cosmetic change and admittedly not earth
shattering. However, it makes the language a bit nicer and
easthetically pleasing. This syntax would be a particulary nice for
alternative ways of defining functions. For example, the type-checked
functions of the
ensurer package
would look a bit more natural:

type_checked <- function_(a ~ integer, b ~ character) {
  some_call(a)
  other_call(b)
}

To make this work in the most R-like possible way, I decided to let
the function call be any expression. This mirrors the syntax of
regular function calls which may be embedded in arbitrary ways. In the
following snippet, russian_dolls() returns a list whose first
element is a function that returns a function that returns 3:

russian_dolls()[[1]]()()
#> 3

This kind of constructs are also possible with labelled blocks:

my_block[[1]]()() {
  code
}

The only requirement is that the end result of the expression be a
function that accepts at least one argument (the block of code). This
means that test_that() would be implemented in this way:

test_that <- function(desc) {
  force(desc)

  function(code) {
    test_code(desc, substitute(code), env = parent.frame())
    invisible()
  }
}

Then,

test_that("my code works") {
  check_equal(A, B)
  check_identical(C, D)
}

# Is actually equivalent to
(function(code) {
    test_code(desc, substitute(code), env = parent.frame())
    invisible()
 })({
   check_equal(A, B)
   check_identical(C, D)
 })

In addition to expressions, simple labels are of course allowed:

label {
  line1
  line2
}

For instance this would fit well with the
Nimble DSL for specifying Bugs
models. Simple labels work a bit differently than expressions
though. Here, instead of looking for a function named label, the
parser will look for label{}. This makes it possible to use the same
identifier for a regular function call and a labelled block:

label <- function() 3
`label{}` <- function(code) 4

label()
#> [1] 3

label {
  anything
}
#> [1] 4

Finally, note that contrary to other labelled blocks such as function
definitions, the opening curly bracket must be on the same line as
its identifier. Otherwise it would be ambiguous whether we have a
labelled block or two expressions separated by a newline:

label
{code}

This slight inconsistency is the price to pay for that syntax
extension.

Piping Operator

This is of course the syntax update that many people of the R
community are waiting for. A native piping operator. Some of the most
popular R packages are based on piped interface: dplyr who
popularised magrittr, but also ggplot2. The latter uses a custom
non-functional pipeline by overloading the + operator but the sequel
ggvis does rely on functional piping. I provide for testing purposes
two versions
of a native pipe operator, |> and >>.

Given the popularity of the pipe, having native support for it in R's
syntax would be a huge progress. Besides the obvious aesthetic concern
(though you do get accustomed to %>% with time) native handling of
the pipe would improve error recovery. Here is how a traceback
currently looks like with magrittr's pipe:

fail <- function(...) stop("fail")
mtcars %>% lapply(fail) %>% unlist()
#> Error in FUN(X[[i]], ...) (from #1) : fail

traceback()
#> 15: stop("fail") at #1
#> 14: FUN(X[[i]], ...)
#> 13: lapply(., fail)
#> 12: function_list[[1L]](value)
#> 11: unlist(.)
#> 10: function_list[[1L]](value)
#> 9: withVisible(function_list[[1L]](value))
#> 8: freduce(value, `_function_list`)
#> 7: Recall(function_list[[1L]](value), function_list[-1L])
#> 6: freduce(value, `_function_list`)
#> 5: `_fseq`(`_lhs`)
#> 4: eval(expr, envir, enclos)
#> 3: eval(quote(`_fseq`(`_lhs`)), env, env)
#> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
#> 1: mtcars %>% lapply(fail) %>% unlist()

This ugly traceback includes all the steps where magrittr manipulates
the unevaluated code. Here is the same traceback with native support:

mtcars |> lapply(fail) |> unlist()
#> Error in FUN(X[[i]], ...) : fail

traceback()
#> 4: stop("fail") at #1
#> 3: FUN(X[[i]], ...)
#> 2: lapply(mtcars, fail)
#> 1: unlist(mtcars |> lapply(fail))

The _ character is also legalised so it can become the placeholder
in pipelines. The same rules as with magrittr's placeholder apply:

mtcars |>
  list(_, _) |>
  identical(list(mtcars, mtcars))
#> [1] TRUE

mtcars |>
  list(list(_, _)) |>
  identical(list(mtcars, list(mtcars, mtcars)))
#> [1] TRUE

I actually provide two implementations of the pipe. The first creates
a classic binary operator that calls a special primitive
function. These are a class of core R function that do not evaluate
their arguments, which allows them to manipulate quoted code before
evaluation.

The second implementation, called by the >> operator, directly
manipulate the parse tree. This means that you cannot redefine >>. R
will always transform the expression object >> call() to
call(object) and you'll never get a chance to call the operator
manually with prefix notation. Such syntax transformation applies to a
few operators in R, like the rightward assignment op -> or the
double starred exponentiation **. By contrast, the first operator
|> does accept to be redefined and called with prefix notation.

I think the first implementation is more natural in the R language and
consistent with most operators. On the other hand, manipulating the
parse tree ensures that the placeholder _ will always act
consistently as a shortcut for the LHS. This would avoid the conflicts
that arise with the . placeholder which is currently used for
different conflicting purposes in dplyr, magrittr and purrr. Thus
there are pros and cons for both approaches.

Could this get into R Core?

R Core has gotten the reputation of being a bit conservative, which is
only fair considering the responsibility that weighs on their
shoulders.

I think that contrarily to proposals for integrating optional type
checking in the syntax, all four of these syntax changes clearly fit
the spirit of R as a dynamic, functional language. When it makes
sense, they can be manipulated like first class citizens through
prefix notation like other language constructs. They shouldn't disturb
any existing R code and they improve currently used R idioms rather
than invent new ones. So I think there is a chance that R core could
consider some of them.

More testing is needed to assess the consequences in terms of
performance and backward compatibility, though I didn't find any
problem from my limited testing. One point of contention might be that
the bare brackets and labelled blocks increase the number of
shift-reduce conflicts during parser generation. I guess many of those
could be fixed by refactoring the grammar a bit, or adding precedence
and association directives to some production rules. But core members
will probably feel a bit nervous about applying non-trivial changes to
that fundamental part of the R code that basically didn't change since
the first available revision in
1997
.
It's probably ok to ignore these conflicts however. There's currently
81 of them and Bison, the parser generator, seems to be doing a very
good job of automatically resolving the ambiguities.

My plan is to get community feedback on Twitter before proposing the
changes to R core. In case they are interested in some them, I'll run
a comprehensive test on CRAN packages to make sure that the new syntax
doesn't break anything.

So, could R 4.0 look like this?

 test_that("new syntax works") {

   data <- list(mtcars, 1, 2, list(3, mtcars, 4))
   expected <- lapply(data, function(x) is.list(x) || is.double(x))

   mtcars |>
     [1, 2, [3, _, 4]] |>
     map([x] -> is.list(x) || is.double(x)) |>
     check_equal(expected)

 }

To leave a comment for the author, please follow the link and comment on their blog: - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)