When you are conducting a business analysis project with a data extract from the company’s internal system, professional risk management suggests you make sure you are not missing any records or double counting any records. But you certainly don’t want to look at every record. Yikes!
Auditors solve this predicament with control totals. When the sums of key fields and the numbers of records match known values, usually from some well-established “production report,” it can be assumed your data “reconciles.” *
What does it mean to calculate “control totals” of a general data.frame?
For numeric columns it’s obvious to use sum, the only question being what to do with missing values. For non-numeric columns, “sum” is not well-defined. You could say NA, but IMO it would be more informative to show the number of non-NA values.
Hmm, maybe this has already been done before.
- You can calculate sum, or a version of count, but not both
- You must know the names of the columns in advance.
Here is my one-line function called controlTotals:
sapply(x, function(x) if (is.numeric(x)) sum(as.numeric(x), na.rm = na.rm) else isum(!is.na(x)))
In typical business use cases, the most important fields to check should not have missing values, so na.rm=FALSE by default. In that case NA’s in a column are detected by
- numeric: NA returned
- non-numeric: the number of non-NA values will be less than the number of rows.
If NA’s in the data are acceptable, then set na.rm=TRUE and the values returned will be
- numeric: the sum of all non-NA values (or NA if all are NA)
- non-numeric: always the number of non-NA values.
Here is an example.
c = letters[1:3], d = c(“high”, NA, “low”))
a b c d
6 NA 3 2
> controlTotals(extract, TRUE)
a b c d
6 4 3 2
In the first, default case, column b’s control total is NA and column d’s control total is less than the number of records. Aha! there’s something missing in both columns! Better check that out.
- R-technicality: In the definition of controlTotals you may be wondering why as.numeric is called after is.numeric has already been established. That’s because integer’s are numeric, and summing integers can result in an overflow if the field contains large values, as when whole numbers are used for “record ID’s.” R recovers gracefully with only a warning in those situations, but error, recovery and warning are easily avoided with this extra step. BTW, this extra step was found in the code for gdata‘s nobs method (kudos Gregory Warnes et. al.)
* P&C actuaries note: reconciliation is a professional concern not restricted to Schedule P