Site icon R-bloggers

Data-driven unit testing for data scientists and quant developers alike

[This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Often overlooked, testing is a critical process that saves time over the long term and enables building complex systems. Unit tests for model (descriptive, predictive, or prescriptive analytics) systems differ from standard software. Effective unit testing of computational systems requires a particular emphasis on the data used in the tests. Below are some guidelines to follow to streamline your unit testing and avoid ambiguous results.

Every test must add information

First, what are we trying to accomplish with testing? At a very high level, we want certainty from our code. Unfortunately, we don’t always implement things correctly. Sometimes, when we add new functionality, we break old. Worse, we can’t always control the inputs to our system, and the system fails when we least expect it. Testing helps us understand how our system will behave under different conditions. Doing so increases our confidence in the reliability of our system. With greater reliability, we can build atop the software in other systems.

If a test doesn’t provide any useful information, the test should be eliminated. These tests are a waste of time and act like noise in the code. Suppose I write a trivial function to convert a timestamp to GMT. I often load data saved in GMT, but sometimes it gets converted to local timezone when reading in. So we need to convert it back. It looks like

to_gmt <- function(x) as.POSIXct(as.character(x), tz='GMT')

Do I need to test this? I could, but how much information would a test provide? Not much. If I chose to test this function, I would probably focus on an edge case, such as what happens if I don't get a POSIXct as input.

Let's look at a slightly more complicated example. Suppose I want to compute the precision of my model. I often implement this from scratch to reduce the dependencies in my code. One implementation is

precision <- function(predicted, actual) {
  cm <- table(actual, predicted)
  cm['TRUE','TRUE'] / sum(cm[,'TRUE'])
}

Do I need to test this? As a sanity check, we probably want to test one expected scenario, since it's easy to confuse the implementation of precision and recall. We might also want to verify that the result is unaffected by the ordering of the labels.

Test one situation per unit test

If you’ve ever answered a question with “it depends”, you know that different situations result in different answers. The same is true of testing, and that is what we want to test. From the perspective of the scientific method, we know that to determine if a particular variable has an effect on an experiment, all other variables must be held constant. The same is true of testing. When a test containing multiple scenarios fails, how do you know what caused the test failure? It requires more work to determine the cause of a failure versus separating the tests.

For example, using testit in R, suppose I chose to test two scenarios in my to_gmt function. The first scenario is a sanity test to make sure it does what I expect, while the second is an edge case (and not really valid — why?).

assert('to_gmt works', {
  x1 <- as.POSIXct("2018-03-01 16:00:00")
  x2 <- as.POSIXct("2018-03-01 16:00:00+0200")
  (as.character(to_gmt(x1)) == "2018-03-01 16:00:00 GMT")
  (as.character(to_gmt(x2)) == "2018-03-01 14:00:00 GMT")
})

If this test fails, what caused the problem? It's easier to know by creating two distinct tests.

assert('to_gmt sanity test', {
  x <- as.POSIXct("2018-03-01 16:00:00")
  (as.character(to_gmt(x)) == "2018-03-01 16:00:00 GMT")
})

assert('to_gmt when input has embedded timezone', {
  x <- as.POSIXct("2018-03-01 16:00:00+0200")
  (as.character(to_gmt(x)) == "2018-03-01 14:00:00 GMT")
})

Focus on edge cases

Mathematical functions have a standard suite of tests that need to be performed. These include normal operation (the sanity or smoke test), followed by boundary conditions, followed by bad data scenarios. Boundary conditions map to the domain of a function. In these cases, data typically has the correct type, but the value is not appropriate for the function. For example, division by 0 is undefined, so it’s important to know what happens when a 0 is given to the division operator. Similarly, the logarithm isn’t defined for negative numbers, so a negative number is out of bounds and should be tested. Adhering to the rule that each test must add information, there isn’t any reason to add more than one test, since all negative numbers are a member of the same boundary condition.

Bad data scenarios are different from boundary scenarios. Here, the type of data is simply wrong. For log, this might be a character. There are innumerable bad data scenarios, and they don’t all need to be tested (see below for more discussion).

Boundary conditions almost always need to be tested. Some bad data scenarios need to be tested. How do you distinguish between the two? Consider matrix inversion. A boundary condition is a singular matrix. A bad data scenario is a non-square matrix.

Make test coverage proportional to usage and impact

Not all scenarios need to be tested. Indeed, any student of logic knows that it’s impossible to prove statements as absolutely true (where deduction can’t be applied). In testing, the concept of coverage describes how much of a system is tested. Simplistic test coverage tools check to see if each function (or method) has a test. Sometimes they’ll perform code analysis to look for different conditions that may require testing. Unfortunately, coverage doesn’t understand boundary conditions and edge cases. Thus, attempting to achieve 100% test coverage is not only a fool’s errand but is also a red herring.

As an alternative, an efficient heuristic is to evaluate how important a function is and write sufficient tests commensurate with its importance. Two ways to measure importance include 1) how often a function is used, and 2) how bad is it if the function fails. If the answer is either “a lot” or “pretty bad”, then you likely need a lot of tests pretty badly.

Use realistic test cases

It’s tempting to construct explicit data structures to test with, even if that’s not the intended use case. It’s better to write tests that are consistent with a function’s raison d’être. One reason is that test cases can act like documentation when the test scenarios are realistic. If they aren’t realistic, the test cases only demonstrate the mechanics of an operation without communicating the semantics of the operation.

Consider the which function, which takes a logical vector and returns a set of ordinals. A mechanical test constructs inputs explicitly and exclusively tests the mechanical operation of the function.

assert('which returns correct ordinals', {
  all.equal(which(c(TRUE,FALSE,TRUE,FALSE)), c(1,3))
})

This test does indeed verify the function’s behavior, but it doesn’t communicate how this function is used in the real world. A better test provides some context to provide insight into the semantics of the function. For example, which extracts ordinals based on some condition. A better test illustrates this use case by generating the logical vector via a conditional expression.

assert('which returns correct ordinals', {
  x <- 11:14
  all.equal(which(x %% 2 == 1), c(1,3))
})

The test now indicates that which typically takes a conditional expression that evaluates to a logical vector.

Favor programmatic assertions over manual assertions

How often do you write code from scratch versus copying and pasting a snippet from elsewhere? Reusing bits of code is common in testing. Many unit testing frameworks have a lot of boilerplate code, not to mention tests can have repetitive setup tasks. When creating data for tests and evaluating data afterward, it’s important to do this in a programmatic manner.

Consider a function reverse that reverses a vector. This is what manual assertions look like.

assert('reverse a vector', {
  x <- c(2,3,4)
  act <- reverse(x)
  (act[1] == 4)
  (act[2] == 3)
  (act[3] == 2)
})

These assertions only work for vectors of length 3. Sometimes it's not worth programmatically generating data for a test. But using manual assertions is particularly bad. After duplicating a code snippet, you may change some input data but not change the assertion. In some cases, the new behavior will pass despite the test not being robust.

A better test tests equality programmatically. Now the assertion works generically for any two vectors of equal length. Notice that I also explicitly defined the expected vector, further templatizing the code.

assert('reverse a vector', {
  x <- c(2,3,4)
  exp <- c(4,3,2)
  act <- reverse(x)
  all.equal(act, exp)
})

Use data that unequivocally verifies function behavior

Let’s return to the scientific method. Earlier I mentioned that to control variables, each test should be in a separate function. Adhering to this rule, there’s still a chance that a test is ambiguous. Here’s how it works. Consider the reverse function once more. I test it like this:

assert('reverse a list 1', {
  x <- c(2,2,2,2)
  exp <- c(2,2,2,2)
  act <- reverse(x)
  all.equal(act, exp)
})

Is this a good choice of data for the test? No, because we can't tell whether or not the function did the right thing. There are all sorts of function implementations that can yield the same result, such as function(x) x, or function(x) sample(x). So this data does not produce an unequivocal interpretation of the result.

How about x <- c(2,2,3,3)? This is also not optimal, because the duplicate values can hide extraneous functionality. Better to use unique values. What about x <- c(4,3,2,1)? These are unique, but can I get the correct answer using a different algorithm? Sure, with function(x) x[order(x)], so this isn’t an unequivocal result.

In short, if there exists more than one plausible explanation for how a result is produced, the test is not unequivocal. Spend more time choosing better test data.

Explicitly distinguish between values and ordinals

In R ordinals (aka array indices) are first-class. Ordinals conveniently represent a set and can be used across columns or rows in a data frame. Ordinals are clearly different from values, but we often use values that look like ordinals in tests. This choice of test data makes interpretation ambiguous.

For example, in my earlier test of which, I chose the input as x <- 11:14. What if I chose x <- 1:4 instead?

assert('which returns correct ordinals', {
  x <- 1:4
  all.equal(which(x %% 2 == 1), c(1,3))
})

Mechanically, I'm still testing the same thing, but I've introduced ambiguity into the test. If I'm unfamiliar with which, can I determine definitively what the function is returning: ordinals or values? It’s unclear because my values look like ordinals. Using a different range of values helps to distinguish between the two.

Conclusion

Testing is an important part of model development. To get the most out of testing, it’s important to follow some simple guidelines. The key is to pay attention to the data you are using in your tests and focus on testing the most important parts of your model and/or system. While the examples are in R, the same guidelines apply to any language your models are written in, including Python and Javascript.

If you found this article helpful, please share it on social media. Better yet, write better tests!

Brian Lee Yung Rowe is founder and CEO of Pez.AI, a data science company that happens to make chatbots. Learn more about how Pez.AI is creating the first data-driven chatbot platform at pez.ai.

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.