# Data wrangling operations with quantities

**r-spatial**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third blog post on

`quantities`

, an

R-Consortium funded project for quantity calculus with R. It is aimed at

providing integration of the ‘units’ and ‘errors’ packages for a

complete quantity calculus system for R vectors, matrices and arrays,

with automatic propagation, conversion, derivation and simplification of

magnitudes and uncertainties. This article investigates the

compatibility of common data wrangling operations with quantities. In

previous articles, we discussed a first working

prototype

and units and errors

parsing.

## Compatibility with different workflows

The bulk of this work can be found in a new vignette entitled *A Guide to Working with Quantities*.

There, you may find a comprehensive set of examples of the main data

wrangling operations (subsetting, ordering, transformations,

aggregations, joining and pivoting) in two distincts worflows: R base

and the

*tidyverse*. Here, we intend to

provide a brief summary.

As we have discussed in previous articles, quantities are implemented as

S3 objects with custom units and errors attributes. All the main

operators that can be applied to vectors and arrays are properly defined

so that they are forwarded to the attributes. This is important to

preserve units (one unit for the entire vector/array), but is critical

to correctly propagate errors (one error per vector/array element). If

operations are not forwarded, object corruption occurs.

### R base

Data wrangling operations on data frames map to R functions as follows:

- Row subsetting:
`[`

or`subset`

. - Row ordering:
`[`

with`order`

. - Column transformation:
`within`

or`transform`

. - Row aggregation:
`aggregate`

. - Column joining:
`merge`

. - (Un)pivoting:
`reshape`

.

R base functions make intensive use of the `[`

generic. Therefore, as

expected, all the operations work correctly with units and errors

metadata. The only drawback is that aggregations by default will drop

quantities metadata. The reason is that there is a family of functions

(not only `aggregate`

, but also `by`

and the `apply`

family) which holds

intermediate results in lists, and these are finally simplified by

calling `unlist`

.

There is no workaround for this default behaviour, because it is not

possible to define methods for *lists of something*. Fortunately, all

these functions support a parameter called `simplify`

(sometimes,

`SIMPLIFY`

) which, if set to `FALSE`

, avoids the `unlist`

call and

returns the results in a list. Then, a call to `do.call(c, ...)`

will

unlist quantities without losing attributes or classes.

### Tidyverse

Data wrangling operations on data frames map to tidyverse functions as

follows:

- Row subsetting:
`dplyr::filter`

(and others). - Row ordering:
`dplyr::arrange`

. - Column transformation:
`dplyr::transmute`

and`dplyr::mutate`

. - Row aggregation:
`dplyr::summarise`

(and others) with

`dplyr::group_by`

for observation grouping. - Column joining:
`dplyr::*_join`

family. - (Un)pivoting:
`tidyr::gather`

and`tidyr::spread`

.

The tidyverse handles quantities correctly for subsetting, ordering and

transformations. It fails to do so for aggregations (grouped operations

in general), column joining and (un)pivoting. Most of these

incompatibilities are due to the same internal grouping mechanism, which

is in C and prevents the R subsetting operator from being called (which

in turn calls the subsetting operator on the errors attribute).

Interestingly, those operations still work for units alone, except for

column gathering, which drops all classes and attributes. It seems

though that there are long-term plans in `dplyr`

for supporting

vectorised attributes (see

tidyverse/dplyr#2773

and

tidyverse/dplyr#3691).

### A note on `data.table`

*Currently* (v1.11.4) `data.table`

does not work well with vectorised

attributes. The underlying problem is similar to `dplyr`

’s issue, but

unfortunately it affects more operations, including row subsetting and

ordering. Only column transformation seems to work, and other operations

generate corrupted objects. This issue was reported on GitHub (see

Rdatatable/data.table#2948).

## Future directions of units and errors

A couple of weeks ago, I had the pleasure of visiting Edzer Pebesma at

the Institute for Geoinformatics in Muenster, and we had a nice

R-quantities summit.

R-quantities summit with

@Enchufa2

: merging rigorous error and units propagation to enable quantity

calculus for R vectors. Thanks to

@RConsortium

! https://t.co/1dJAnZCyIM

pic.twitter.com/Wp6fRrn3WQ— Edzer Pebesma (@edzerpebesma)

13

de junio de 2018

We had a very productive discussion on the future

directions

of the `units`

and `errors`

packages. These are some of the ideas on the

table:

- As a follow-up to the previous

milestone,

we found interesting the idea of enhancing the`readr`

package to

allow third-party packages to provide new column types and parsers

that would work transparently. There are other interesting use

cases, such as reading spatial data. We registered the

proposal in the

`readr`

’s repository. - We discussed a recent

proposal by Bill

Denney (and had a most interesting chat with him) in which he

requests support for*mixed units*in R vectors and arrays. Bill

works with data from clinical studies and deals with a very specific

format. I refer to the issue at hand (previous link and references

therein) for specific examples and further discussion. Edzer already

started to work on this, and there is a functional prototype in the

`mixed`

branch on Github. - As a mid-term plan, we would also like to add support for other

propagation methods to the`errors`

package. More specifically,

instead of storing a single value and an associated error (and

applying TSM), we plan to provide support for full samples.

Operations would work directly on these samples, so that every kind

of correlation would be captured.

## Next steps

The R-quantities project is coming to an end. The next and final

milestone will try to provide a proof-of-concept to wrap `lm`

methods,

where errors are used to define weights in the linear model and units

propagate to the regression coefficient estimates and residuals. We will

also complete the documentation with the prospect of a first release of

the `quantities`

package on CRAN.

**leave a comment**for the author, please follow the link and comment on their blog:

**r-spatial**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.