How cdata Control Table Data Transforms Work

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the fundamental transforms that convert between data representations. The two representations it converts between are:

  • A world where all facts about an instance or record are in a single row (“rowrecs”).
  • A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

rowrecs_to_blocks() does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms rowrecs_to_blocks() is therefore a join of the data to the control table. Conversely blocks_to_rowrecs() combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in R as unpivot/pivot (pivot invented by Pito Salas), stack/unstack, melt/cast, or gather/spread. These special cases are handled in cdata by convenience functions unpivot_to_blocks() and pivot_to_rowrecs(). A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).

citation("cdata")

To cite package ‘cdata’ in publications use:

  John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/,
  https://winvector.github.io/cdata/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {cdata: Fluid Data Transformations},
    author = {John Mount and Nina Zumel},
    year = {2019},
    note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/},
  }

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)