Let’s Have Some Sympathy For The Part-time R User

August 4, 2017
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

  • dplyr users who had such a need, and wanted such extensions.
  • dplyr users who did not have such a need ("we always know the column names").
  • dplyr users who found the then-current fairly complex "underscore" and lazyeval system sufficient for the task.

Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution.

Background

Roughly I suggested two possible methods for making the task easier:

  • Renaming views for data.frames. I have now implemented the idea as a call-scoped concept in replyr::replyr_apply_f_mapped() ("call-scoped", meaning the re-mapping lasts for the duration of a function call).
  • Symbol re-binding by a block-scoped command called let() (a common functional notation; and "block-scoped" meaning the re-mapping lasts for the span of a code-block). I released this solution to CRAN and publicly announced it on December 8 2016.

I mention dates to point out that this is something I have been inviting public comment on for some time.

Things change. Since the above time:

  • The development version of dplyr incorporated a new rlang/tidyeval package (probably around February 14th 2017).
  • rlang/tidyeval was released to CRAN on May 2017. Obviously rlang/tidyeval had been under development for some time, but I don’t think the parametric aspect of it was publicly discussed much before February 16, 2017 (notice that a formula centric interface was still being contemplated).
  • dplyr 0.7.0 was released, based on rlang/tidyeval (June 9th, 2017).
  • dplyr excised direct use of lazyeval.
  • The dplyr "underscore verbs" (or methods) were all deprecated (i.e., no longer advised).

The rlang/tidyeval strategy is to capture un-evaluated user expressions (as a new object called a "quosure") and evaluate them with new language rules (with new bindings and something called an "overscope"). Also note the rlang/tidyeval strategy is full integration or re-writing of packages in terms of rlang/tidyeval; this isn’t something you mix-in or turn on or off.

Some points I think that have been under-represented in previous discussions include:

  • Not all R users consider themselves to be expert programmers (many are happy calling themselves analysts or statisticians).
  • R is often used in collaborative projects where there are varying levels of programming expertise.

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let’s have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

The example

The design and discussion of substitution solutions should be driven from concrete realistic use cases. Working from larger examples gives us a taste of what working with each solution is like in practice. So, let’s pretend to discuss social science (instead of programming).

Suppose an analyst, psychologist, medical doctor, or scientist is building an assessment for some aspects of behavior and anxiety.

Often such assessments involve selecting moving through a multiple-choice questionnaire and collecting a number of points that depend on answers selected. One such assessment is the Generalized Anxiety Disorder 7 questionnaire (or GAD-7). It is a very simple system as can be seen below.

GAD708 19 08Cartwright

One can treat such a test score as a classifier and assess it in terms of sensitivity, specificity, and different correspondence measures.

An obvious extension of such tests is to give a different number of points in different categories for each multiple-choice answer. For example we could imagine such a test where each answer gave a varying number of points in one of two categories called "withdrawal behavior" and "positive re-framing" (both in the sense of coping behaviors).

For example, our scientist might record the results of two subjects taking a test as follows:

d <- data.frame(
  subjectID = c(1,                   
                1,
                2,                   
                2),
  surveyCategory = c(
    'withdrawal behavior',
    'positive re-framing',
    'withdrawal behavior',
    'positive re-framing'
  ),
  assessmentTotal = c(5,                 
                      2,
                      3,                  
                      4),
  stringsAsFactors = FALSE
)
  
print(d)
##   subjectID      surveyCategory assessmentTotal
## 1         1 withdrawal behavior               5
## 2         1 positive re-framing               2
## 3         2 withdrawal behavior               3
## 4         2 positive re-framing               4
# or in "wide form":
library("cdata")
moveValuesToColumns(d, 
                    columnToTakeKeysFrom = 'surveyCategory',
                    columnToTakeValuesFrom = 'assessmentTotal',
                    rowKeyColumns = 'subjectID')
##   subjectID positive re-framing withdrawal behavior
## 1         1                   2                   5
## 2         2                   4                   3

A natural question is: how does one assign weights to each answer? One way would be to administer the test to a number of people the experimenter has classified as having either of the above mentioned behaviors and then performing a logistic regression to map assessment answers to the probability of a given diagnosis for this population. By re-scaling the weights and rounding them to small integers we could have a test point system that is very close to performing a logistic regression classification. We may be able to use the same assessment questions in a much more decisive manner than assigning all questions the same number of points.

This sort of idea is what one would expect from a mixed and collaborating team that includes medical experts, statistics experts, and programmers. After some work our team might work out that scoring the assessment can be done by the simple R dplyr pipeline:

suppressPackageStartupMessages(library("dplyr"))

scale <- 0.237
  
d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale)))
## # A tibble: 4 x 4
## # Groups:   subjectID [2]
##   subjectID      surveyCategory assessmentTotal probability
##                                        
## 1         1 withdrawal behavior               5   0.6706221
## 2         1 positive re-framing               2   0.3293779
## 3         2 withdrawal behavior               3   0.4410258
## 4         2 positive re-framing               4   0.5589742

For each subject we take the row with maximal probability as the diagnosis. The diagnosis was already obvious from the original scores, the main addition is the diagnosis confidence is now available as a probability estimate.

Each step of the above pipeline is learn-able:

  • The group_by() is arranging all rows associated with the same subject to work together in later calculations.
  • the exp(assessmentTotal * scale)/sum(exp(assessmentTotal * scale)) is the classic "sigmoid link" from logistic regression. It is the standard way (once you know it) of turning a free-score into a probability estimate.

Suppose this assessment is tested and works well. It is then plausible that the team might ask their R expert to help them construct a much more complicated dplyr pipeline that better formats the results. Under the Harlan Mills’ "Surgical Team" proposal (made famous in Frank Brook’s The Mythical Man Month) we expect effective data science teams to have a diversity of deep expertise (not everybody know everything, but a lot is known by the total team). We expect a well staffed research team to include the statistician who worked out the sigmoid transform above, and a programmer who works out the pipeline we give below.

d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale))) %>%
  arrange(probability, surveyCategory) %>%
  mutate(isDiagnosis = row_number() == n()) %>%
  filter(isDiagnosis) %>%
  ungroup() %>%
  select(subjectID, surveyCategory, probability) %>%
  rename(diagnosis = surveyCategory) %>%
  arrange(subjectID)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##                             
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

This is indeed a long (and expert-level) pipeline. But the principle is:

  • It does useful work (concentrates down to the rows we want and ensures good presentation column names and sorting).
  • While a part-time R user would not be expected to come up with it, they could (with cooperation from the pipeline author) understand all the steps and safely use the pipeline in their project.
  • The application (which we spent some time describing) is what the team cares about, the pipeline is a ends to a means (so even though it is long, it isn’t often the central subject of interest).
  • The longer pipeline is paying the bills, and helping patients. So some pain and cost are to be tolerated.

Let’s take this deliberately long (so as to be a strong test) example and see how hard the pipeline is to re-use under different methodologies.

Re-use

An issue that comes up is: can the team re-use the pipeline on another project? Suppose in their next project the ID column isn’t "subjectID" but it is "patientID" (and so on). Obviously they can copy and paste the original pipeline and change the names (which is not a bad practice for the first few re-uses).

But once this procedure is going to be used many times it is a good idea to wrap it up or genericize it so it can be safely re-adapted (so the users can’t accidentally forget to change one name one place).

I will now walk through a number of approaches to this in terms of how hard they are on the researcher. We are assuming their R expert does the wrapping for them, but then must explain the concepts to the part-time R user so they truly understand and can maintain the tools they are using.

For our example we assume all the column names are coming from variables set somewhere else (in another R script, or coming from a spreadsheet that is read into R, or some other source). The nature of the columns is constant from analysis to analysis, but the exact names used may vary. For our example the column names are:

idCol        <- "subjectID"
categoryCol  <- "surveyCategory"
linkScoreCol <- "assessmentTotal"
indicatorCol <- "isDiagnosis"
probScoreCol <- "probability"
outcomeCol   <- "diagnosis"

wrapr solution

In my opinion the easiest solution (in terms of cognitive load) is wrapr::let(). The R expert would share the following code:

library("wrapr")

let(
  c(
    IDCOL        = idCol,
    CATEGORYCOL  = categoryCol,
    LINKSCORECOL = linkScoreCol,
    INDICATORCOL = indicatorCol,
    PROBSCORECOL = probScoreCol,
    OUTCOMECOL   = outcomeCol
  ),
  
  d %>%
    group_by(IDCOL) %>%
    mutate(PROBSCORECOL =
             exp(LINKSCORECOL * scale)/
             sum(exp(LINKSCORECOL * scale))) %>%
    arrange(PROBSCORECOL, CATEGORYCOL) %>%
    mutate(INDICATORCOL = row_number() == n()) %>%
    filter(INDICATORCOL) %>%
    ungroup() %>%
    select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
    rename(OUTCOMECOL = CATEGORYCOL) %>%
    arrange(IDCOL)
)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##                             
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"let() works as if you had written the code with the names substituted as shown in the c() block."

And there is ample documentation showing how this can be used. Notice creating this code is completely mechanical (replace concrete names with the all-caps place holders) and the execution has an easy mental model (the place-holders are replaced with names stored in the variables).

In this solution the adapted code looks like the original code.

replyr solution

The next easiest method in concept is replyr_apply_f_mapped().

The R expert would write the following, and the part-time R user (with some coaching) could maintain it.

library("replyr")

d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),
    
    f = . %>%
      group_by(IDCOL) %>%
      mutate(PROBSCORECOL =
               exp(LINKSCORECOL * scale)/
               sum(exp(LINKSCORECOL * scale))) %>%
      arrange(PROBSCORECOL, CATEGORYCOL) %>%
      mutate(INDICATORCOL = row_number() == n()) %>%
      filter(INDICATORCOL) %>%
      ungroup() %>%
      select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
      rename(OUTCOMECOL = CATEGORYCOL) %>%
      arrange(IDCOL)
  )
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##                             
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

What the code does is exactly this:

  • It renames all of the columns in the data.frame to have the chosen names (in this case the all-caps names).
  • It then applies the user-supplied function f() to this data.frame.
  • The reverse of the name-mapping is applied to the result of f(), moving columns back to their original names.

The concept is:

replyr_apply_f_mapped() renames columns and back.

Below is an illustrative example showing the column names seen inside and outside the user supplied function.

print(colnames(d))
## [1] "subjectID"       "surveyCategory"  "assessmentTotal"
d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),
    
    f = function(df) {
      df$PROBSCORECOL <- 1
      print(colnames(df))
      return(df)
    }
  ) %>%
  colnames()
## [1] "IDCOL"        "CATEGORYCOL"  "LINKSCORECOL" "PROBSCORECOL"

## [1] "subjectID"       "surveyCategory"  "assessmentTotal" "probability"

This is teachable and something the part-time R user can correctly extend and maintain. Though the user may possibly need to learn about wrapping a pipeline as an anonymous function (the ". %>%" notation).

rlang/tidyeval solution

For the rlang/tidyeval solution the expert writes the following code:

IDSYM        <- rlang::sym(idCol)
CATEGORYSYM  <- rlang::sym(categoryCol)
LINKSCORESYM <- rlang::sym(linkScoreCol)
INDICATORSYM <- rlang::sym(indicatorCol)
PROBSCORESYM <- rlang::sym(probScoreCol)
OUTCOMESYM   <- rlang::sym(outcomeCol)

d %>%
  group_by(!!IDSYM) %>%
  mutate(!!PROBSCORESYM :=
           exp((!!LINKSCORESYM) * scale)/
           sum(exp((!!LINKSCORESYM) * scale))) %>%
  arrange(!!PROBSCORESYM, !!CATEGORYSYM) %>%
  mutate(!!INDICATORSYM := row_number() == n()) %>%
  filter(!!INDICATORSYM) %>%
  ungroup() %>%
  select(!!IDSYM, !!CATEGORYSYM, !!PROBSCORESYM) %>%
  rename(!!OUTCOMESYM := !!CATEGORYSYM) %>%
  arrange(!!IDSYM)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##                             
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

Several points have to be taught to the part-time R user if this code is to be maintained:

  • The "!!" symbol does not have the same operator precedence as an assignment symbols such as "=" or ":=", so you must often place "!!"-expressions in extra parentheses.
  • In any assignment we must use ":=" for assignment when using "!!" on the left-hand side of the assignment.

The above are just some syntax edge-cases, we haven’t even gone into teaching rlang::sym(), "!!", and the theory and semantics of quasi-quotation.

seplyr solution

seplyr is an experiment to see what a referentially transparent (or completely value oriented) interface to dplyr would look like. Please don’t think of seplyr as an adapter (though it is, it sends all work to dplyr), but as an illustration of what a completely value-oriented dplyr might look like (i.e., one that did not capture un-evaluated user code through non-standard evaluation). Roughly seplyr is an experiment of the form: "what if one tried harder with something like the new dplyr::*_at() verbs."

Most of the seplyr methods are named *_se() and are designed to be very similar to their dplyr equivalents (and some are nearly identical to dplyr::*_at() methods, rename_se() being a notable exception).

library("seplyr")
suppressPackageStartupMessages(library("glue"))

d %>%
  group_by_se(idCol) %>%
  mutate_se(probScoreCol :=
           glue('exp({linkScoreCol} * scale)/
                  sum(exp({linkScoreCol} * scale))')) %>%
  arrange_se(c(probScoreCol, categoryCol)) %>% 
  mutate_se(indicatorCol := "row_number() == n()") %>%
  filter_se(indicatorCol) %>%
  ungroup() %>%
  select_se(c(idCol, categoryCol, probScoreCol)) %>%
  rename_se(outcomeCol := categoryCol) %>%
  arrange_se(idCol)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##                             
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"Only mutate needs non-standard evaluation."

seplyr accepts general expressions many more places, but with proper organization and using a few temp-columns you really only need the full generality in mutate().

seplyr has its own issues:

  • It also needs a ":=" operator for assignment.
  • It insists on multiple arguments coming in as vectors (hence the use of "c()" throughout).
  • It runs into a bit of trouble with verbs that take expressions (mutate_se() being the most complicated) in that it needs a helper to substitute in the name of the variable holding the column name, which is later substituted out for the actual column name by seplyr. In this example we used glue::glue() to perform the substitution, but we could also try paste0() or gsub().

The lesson from seplyr is the mutate() verb does indeed need some kind of expression manipulation tooling (direct string manipulation feeling too crude). However, for the rest of the verbs the value oriented notation is in fact quite natural, and really in no sense inferior to the dplyr originals.

Conclusion

Name substitution is a reasonable need that arises when re-using R work or when trying to iterate of column names. I have been publicly exploring variations of substitution systems so that R users can make an informed choice of one or more that most meets their needs and addresses their personal trade-offs between: power, safety, readability, and teachability. These sections are not each independent "yet another way of performing name substitution", but parts of a public conversation that should be had before name substitution is considered settled and fixed in stone.

A part-time R user will not have the background to quickly compare all of the available substitution systems. In fact such a user will only come to need a substitution system when they have a problem. So by definition they are in in the middle of some other task. It is up to expert partners to evaluate and explain alternatives.

There is a temptation that if you are going to only teach one system it might as well be rlang/tidyeval as "that is what now comes with dplyr". I feel this is a false savings as while rlang/tidyeval "is already in dplyr" the rlang/tidyeval concepts and details are not "already in the user" (and in fact include a fairly number of irregular exceptions, needing to be taught and memorized).

Our preference is: wrapr::let(). wrapr::let() delivers a lot of (safe) power for a modest amount of cognitive load. Each of the above systems involves different trade-offs and compromises, and we feel one must really try a few in production before having an expert opinion.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)