Arbitrary Data Transforms Using cdata

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have been writing a lot on higher-order data transforms lately:

Cdata

What I want to do now is "write a bit more, so I finally feel I have been concise."

The cdata R package supplies general data transform operators.

  • The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD().
  • These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
  • It is easy to write many different operations in terms of the cdata primitives.
  • These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data we use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants).
  • The transforms are controlled by a control table that itself is a diagram of or picture of the transform.

We will end with a quick example, centered on pivoting/un-pivoting values to/from more than one column at the same time.

Suppose we had some sales data supplied as the following table:

SalesPersonPeriodBookingsWestBookingsEast
a2017Q1100175
a2017Q2110180
b2017Q12500
b2017Q22450

Suppose we are interested in adding a derived column: which region the salesperson made most of their bookings in.

library("cdata")
## Loading required package: wrapr
library("seplyr")
d <- d  %.>% 
  dplyr::mutate(., BestRegion = ifelse(BookingsWest > BookingsEast, 
                                       "West",
                                       ifelse(BookingsEast > BookingsWest, 
                                              "East", 
                                              "Both")))

Our notional goal is (as part of a larger data processing plan) to reformat the data a thin/tall table or a RDF-triple like form. Further suppose we wanted to copy the derived column into every row of the transformed table (perhaps to make some other step involving this value easy).

We can use cdata::moveValuesToRowsD() to do this quickly and easily.

First we design what is called a transform control table.

cT1 <- data.frame(Region = c("West", "East"),
                  Bookings = c("BookingsWest", "BookingsEast"),
                  BestRegion = c("BestRegion", "BestRegion"),
                  stringsAsFactors = FALSE)
print(cT1)
##   Region     Bookings BestRegion
## 1   West BookingsWest BestRegion
## 2   East BookingsEast BestRegion

In a control table:

  • The column names specify new columns that will be formed by cdata::moveValuesToRowsD().
  • The values specify where to take values from.

This control table is called "non trivial" as it does not correspond to a simple pivot/un-pivot (those tables all have two columns). The control table is a picture of of the mapping we want to perform.

An interesting fact is cdata::moveValuesToColumnsD(cT1, cT1, keyColumns = NULL) is a picture of the control table as a one-row table (and this one row table can be mapped back to the original control table by cdata::moveValuesToRowsD(), these two operators work roughly as inverses of each other; though cdata::moveValuesToRowsD() operates on rows and cdata::moveValuesToColumnsD() operates on groups of rows specified by the keying columns).

The mnemonic is:

  • cdata::moveValuesToColumnsD() converts arbitrary grouped blocks of rows that look like the control table into many columns.
  • cdata::moveValuesToRowsD() converts each row into row blocks that have the same shape as the control table.

Because pivot and un-pivot are fairly common needs cdata also supplies functions that pre-populate the controls tables for these operations (buildPivotControlTableD() and buildUnPivotControlTable()).

To design any transform you draw out the control table and then apply one of these operators (you can pretty much move from any block structure to any block structure by chaining two or more of these steps).

We can now use the control table to supply the same transform for each row.

d  %.>% 
  dplyr::mutate(., 
                Quarter = substr(Period,5,6),
                Year = as.numeric(substr(Period,1,4)))  %.>% 
  dplyr::select(., -Period)  %.>% 
  moveValuesToRowsD(., 
                    controlTable = cT1, 
                    columnsToCopy = c('SalesPerson', 
                                      'Year', 
                                      'Quarter')) %.>% 
  arrange_se(., c('SalesPerson', 'Year', 'Quarter', 'Region'))  %.>% 
  knitr::kable(.)  
SalesPersonYearQuarterRegionBookingsBestRegion
a2017Q1East175East
a2017Q1West100East
a2017Q2East180East
a2017Q2West110East
b2017Q1East0West
b2017Q1West250West
b2017Q2East0West
b2017Q2West245West

Notice we were able to easily copy the extra BestRegion values into all the correct rows.

It can be hard to figure out how to specify such a transformation in terms of pivots and un-pivots. However, as we have said: by drawing control tables one can easily design and manage fairly arbitrary data transform sequences (often stepping through either a denormalized intermediate where all values per-instance are in a single row, or a thin intermediate like the triple-like structure we just moved into).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)