It is Needlessly Difficult to Count Rows Using dplyr

Posted on September 3, 2017 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Question: how hard is it to count rows using the R package dplyr?
Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this “dplyr inferno”).

Let’s take an example from sparklyr issue 973:

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")

## [1] '0.7.2.9000'

library("sparklyr")
packageVersion("sparklyr")

## [1] '0.6.2'

sc <- spark_connect(master = "local")

## * Using Spark: 2.1.0

db_drop_table(sc, 'extab', force = TRUE)

## [1] 0

DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab")
DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)")
DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)")

dRemote <- tbl(sc, "extab")
print(dRemote)

## # Source:   table<extab> [?? x 1]
## # Database: spark_connection
##       n
##   <raw>
## 1    01
## 2    02
## 3    03

dLocal <- data.frame(n = as.raw(1:3))
print(dLocal)

##    n
## 1 01
## 2 02
## 3 03

Many Apache Spark big data projects use the TINYINT type to save space. TINYINT behaves as a numeric type on the Spark side (you can run it through SparkML machine learning models correctly), and the translation of this type to R‘s raw type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have R raw columns in them, so we should expect our tools to work properly with such columns present.

Now let’s try to count the rows of this table:

nrow(dRemote)

## [1] NA

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

Checking if a table is empty.
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
Confirming data size is the same as reported in other sources (Spark, database, and so on).
Reporting amount of work performed or rows-per-second processed.

The obvious generic dplyr idiom would then be dplyr::tally() (our code won’t know to call the new sparklyr::sdf_nrow() function, without writing code to check we are in fact looking at a Sparklyr reference structure):

tally(dRemote)

## # Source:   lazy query [?? x 1]
## # Database: spark_connection
##      nn
##   <dbl>
## 1     3

That works for Spark, but not for local tables:

dLocal %>% 
  tally

## Using `n` as weighting variable

## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.

The above issue is filed as dplyr issue 3070. The above code usually either errors-out (if the column is raw) or creates a new total column called nn with the sum of the n column instead of the count.

data.frame(n=100) %>% 
  tally

## Using `n` as weighting variable

##    nn
## 1 100

We could try adding a column and summing that:

dLocal %>% 
  transmute(constant = 1.0) %>%
  summarize(n = sum(constant))

## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector

That fails due to dplyr issue 3069: local mutate() fails if there are any raw columns present (even if they are not the columns you are attempting to work with).

We can try removing the dangerous column prior to other steps:

dLocal %>% 
  select(-n) %>%
  tally

## data frame with 0 columns and 3 rows

That does not work on local tables, as tally fails to count 0-column objects (dplyr issue 3071; probably the same issue exists for may dplyr verbs as we saw a related issue for dplyr::distinct).

And the method does not work on remote tables either (Spark, or database tables) as many of them do not appear to support 0-column results:

dRemote %>% 
  select(-n) %>%
  tally

## Error: Query contains no columns

In fact we start to feel trapped here. For a data-object whose only column is of type raw we can’t remove all the raw columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).

At best we can try something like this:

nrow2 <- function(d) {
  n <- nrow(d)
  if(!is.na(n)) {
    return(n)
  }
  d %>% 
    ungroup() %>%
    transmute(constant = 1.0) %>% 
    summarize(tot = sum(constant)) %>%
    pull()
}

dRemote %>% 
  nrow2()

## [1] 3

dLocal %>% 
  nrow2()

## [1] 3

We are still experimenting with work-arounds in the replyr package (but it is necessarily ugly code).

spark_disconnect(sc)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

It is Needlessly Difficult to Count Rows Using dplyr

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)