Data wrangling operations with quantities

June 26, 2018
By

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

[view raw
Rmd
]

This is the third blog post on
quantities, an
R-Consortium funded project for quantity calculus with R. It is aimed at
providing integration of the ‘units’ and ‘errors’ packages for a
complete quantity calculus system for R vectors, matrices and arrays,
with automatic propagation, conversion, derivation and simplification of
magnitudes and uncertainties. This article investigates the
compatibility of common data wrangling operations with quantities. In
previous articles, we discussed a first working
prototype

and units and errors
parsing
.

Compatibility with different workflows

The bulk of this work can be found in a new vignette entitled A Guide
to Working with
Quantities
.
There, you may find a comprehensive set of examples of the main data
wrangling operations (subsetting, ordering, transformations,
aggregations, joining and pivoting) in two distincts worflows: R base
and the tidyverse. Here, we intend to
provide a brief summary.

As we have discussed in previous articles, quantities are implemented as
S3 objects with custom units and errors attributes. All the main
operators that can be applied to vectors and arrays are properly defined
so that they are forwarded to the attributes. This is important to
preserve units (one unit for the entire vector/array), but is critical
to correctly propagate errors (one error per vector/array element). If
operations are not forwarded, object corruption occurs.

R base

Data wrangling operations on data frames map to R functions as follows:

  • Row subsetting: [ or subset.
  • Row ordering: [ with order.
  • Column transformation: within or transform.
  • Row aggregation: aggregate.
  • Column joining: merge.
  • (Un)pivoting: reshape.

R base functions make intensive use of the [ generic. Therefore, as
expected, all the operations work correctly with units and errors
metadata. The only drawback is that aggregations by default will drop
quantities metadata. The reason is that there is a family of functions
(not only aggregate, but also by and the apply family) which holds
intermediate results in lists, and these are finally simplified by
calling unlist.

There is no workaround for this default behaviour, because it is not
possible to define methods for lists of something. Fortunately, all
these functions support a parameter called simplify (sometimes,
SIMPLIFY) which, if set to FALSE, avoids the unlist call and
returns the results in a list. Then, a call to do.call(c, ...) will
unlist quantities without losing attributes or classes.

Tidyverse

Data wrangling operations on data frames map to tidyverse functions as
follows:

  • Row subsetting: dplyr::filter (and others).
  • Row ordering: dplyr::arrange.
  • Column transformation: dplyr::transmute and dplyr::mutate.
  • Row aggregation: dplyr::summarise (and others) with
    dplyr::group_by for observation grouping.
  • Column joining: dplyr::*_join family.
  • (Un)pivoting: tidyr::gather and tidyr::spread.

The tidyverse handles quantities correctly for subsetting, ordering and
transformations. It fails to do so for aggregations (grouped operations
in general), column joining and (un)pivoting. Most of these
incompatibilities are due to the same internal grouping mechanism, which
is in C and prevents the R subsetting operator from being called (which
in turn calls the subsetting operator on the errors attribute).
Interestingly, those operations still work for units alone, except for
column gathering, which drops all classes and attributes. It seems
though that there are long-term plans in dplyr for supporting
vectorised attributes (see
tidyverse/dplyr#2773
and
tidyverse/dplyr#3691).

A note on data.table

Currently (v1.11.4) data.table does not work well with vectorised
attributes. The underlying problem is similar to dplyr’s issue, but
unfortunately it affects more operations, including row subsetting and
ordering. Only column transformation seems to work, and other operations
generate corrupted objects. This issue was reported on GitHub (see
Rdatatable/data.table#2948).

Future directions of units and errors

A couple of weeks ago, I had the pleasure of visiting Edzer Pebesma at
the Institute for Geoinformatics in Muenster, and we had a nice
R-quantities summit.

We had a very productive discussion on the future
directions

of the units and errors packages. These are some of the ideas on the
table:

  • As a follow-up to the previous
    milestone
    ,
    we found interesting the idea of enhancing the readr package to
    allow third-party packages to provide new column types and parsers
    that would work transparently. There are other interesting use
    cases, such as reading spatial data. We registered the
    proposal
    in the
    readr’s repository.
  • We discussed a recent
    proposal
    by Bill
    Denney (and had a most interesting chat with him) in which he
    requests support for mixed units in R vectors and arrays. Bill
    works with data from clinical studies and deals with a very specific
    format. I refer to the issue at hand (previous link and references
    therein) for specific examples and further discussion. Edzer already
    started to work on this, and there is a functional prototype in the
    mixed branch on Github.
  • As a mid-term plan, we would also like to add support for other
    propagation methods to the errors package. More specifically,
    instead of storing a single value and an associated error (and
    applying TSM), we plan to provide support for full samples.
    Operations would work directly on these samples, so that every kind
    of correlation would be captured.

Next steps

The R-quantities project is coming to an end. The next and final
milestone will try to provide a proof-of-concept to wrap lm methods,
where errors are used to define weights in the linear model and units
propagate to the regression coefficient estimates and residuals. We will
also complete the documentation with the prospect of a first release of
the quantities package on CRAN.

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)