Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re evolving in an increasingly data-driven world. And since critical decisions are taken based on results produced by data scientists and data analysts, they need to be be able to trust the tools they use. It is now increasingly common to add continuous integration to software packages and libraries, to ensure the code is not crashing, and that future updates don’t change your code output (snapshot tests). But one type of test still remains uncommon: tests for statistical correctness. That is, tests that ensure the algorithm implemented in your package actually produce the correct results.

It is likely that most statistical package authors run some tests on their own during development but there doesn’t seem to be guidelines on how to test statistical correctness in a solid and standard way 1.

In this blog post, we explore various methods to ensure the statistical correctness of your software. We argue that these tests should be part of your continuous integration system, to ensure your tools remains valid throughout its life, and to let users verify how you validate your package. Finally, we show how these principles are implemented in the Epiverse TRACE tools.

The approaches presented here are non-exclusive and should ideally all be added to your tests. However, they are presented in order of stringency and priority to implement. We also take a example of a function computing the centroid of a list of points to demonstrate how you would integrate the recommendations from this post with the `{testthat}` R package, often used from unit testing:

```#' Compute the centroid of a set of points
#'
#' @param coords Coordinates of the points as a list of vectors. Each element of the
#'   list is a point.
#'
#' @returns A vector of coordinates of the same length of each element of
#'   `coords`
#'
#' @examples
#' centroid(list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7)))
#'
centroid <- function(coords) {

# ...
# Skip all the necessary input checking for the purpose of this demo
# ...

coords_mat <- do.call(rbind, coords)

return(colMeans(coords_mat))

}```

# Compare your results to the reference implementation

The most straightforward and most solid way to ensure your implementation is valid is to compare your results to the results of the reference implementation. The reference implementation can be a package in another language, an example with toy data in the scientific article introducing the method, etc.

For example, the `{gemma2}` R package, which re-implements the methods from the GEMMA tool written in C++, verifies that values produced by both tools match:

```test_that("Results of gemma2 equal those of GEMMA v 0.97", {
expect_equal(Sigma_ee, diag(c(18.559, 12.3672)), tolerance = 0.0001)
expect_equal(Sigma_uu, diag(c(82.2973, 41.9238)), tolerance = 0.0001)
})```
Example with `centroid()`
```library(testthat)

test_that("centroid() in 1D produces the same results as mean()", {

x <- list(1, 5, 3, 10, 5)

expect_identical(centroid(x), mean(unlist(x)))

})```
`Test passed 🎉`

Note that even if a reference implementation doesn’t exist, it is still good practice to compare your implementation to competing ones. Discrepancies might reveal a bug in your implementation or theirs but in any case, finding it out is beneficial to the community.

However, this approach cannot be used in all cases. Indeed, there may not be a reference implementation in your case. Or it might be difficult to replicate identical computations in the case of algorithm with stochasticity 2.

# Compare to a theoretical upper or lower bound

An alternative strategy is to compare your result to theoretical upper or lower bound. This offers a weaker guarantee that your implementation and your results are correct but it can still allow you to detect important mistakes.

Example with `centroid()`
```test_that("centroid() is inside the hypercube containing the data points", {

x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))

expect_true(all(centroid(x) <= Reduce(pmax, x)))
expect_true(all(centroid(x) >= Reduce(pmin, x)))

})```
`Test passed 🥳`

You can see a real-life example of this kind of test in the `{finalsize}` R package. `{finalsize}` computes the final proportion of infected in a heterogeneous population according to an SIR model. Theory predicts that the number of infections is maximal in a well-mixed population:

```# Calculates the upper limit of final size given the r0
# The upper limit is given by a well mixed population
upper_limit <- function(r0) {
f <- function(par) {
abs(1 - exp(-r0 * par[1]) - par[1])
}
opt <- optim(
par = 0.5, fn = f,
lower = 0, upper = 1,
method = "Brent"
)
opt
}```

# Verify that output is changing as expected when a single parameter varies

An even looser way to test statistical correctness would be to control that output varies as expected when you update some parameters. This could be for example, checking that the values you return increase when you increase or decrease one of your input parameters.

Example with `centroid()`
```test_that("centroid() increases when coordinates from one point increase", {

x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))

y <- x
y[[1]] <- y[[1]] + 1

expect_true(all(centroid(x) < centroid(y)))

})```
`Test passed 🎉`

An example of this test in an actual R package can again be found in the finalsize package:

```r0_low <- 1.3
r0_high <- 3.3

epi_outcome_low <- final_size(
r0 = r0_low,
<...>
)
epi_outcome_high <- final_size(
r0 = r0_high,
<...>
)

test_that("Higher values of R0 result in a higher number of infectious in all groups", {
expect_true(
all(epi_outcome_high\$p_infected > epi_outcome_low\$p_infected)
)
})```

# Conclusion: automated validation vs peer-review

In this post, we’ve presented different methods to automatically verify the statistical correctness of your statistical software. We would like to highlight one more time that it’s important to run these tests are part of your regular integration system, instead of running them just once at the start of the development. This will prevent the addition of possible errors in the code and show users what specific checks you are doing. By doing so, you are transparently committing to the highest quality.

Multiple voices in the community are pushing more towards peer-review as a proxy for quality and validity:

We would like to highlight that automated validation and peer review are not mutually exclusive and answer slightly different purposes.

On the one hand, automated validation fails to catch more obscure bugs and edge cases. For example, a bug that would be difficult to detect via automated approach is the use of bad Random Number Generators when running in parallel.

But on the other hand, peer-review is less scalable, and journals usually have some editorial policy that might not make your package a good fit. Additionally, peer-review usually happens at one point in time while automated validation can, and should, be part of the continuous integration system.

Ideally, peer-review and automated validation should work hand-in-hand, with review informing the addition of new automated valiation tests.

## Footnotes

1. But see the “testing statistical software” post from Alex Hayes where he presents his process to determine if he deems a statistical package trustworthy or not, and rOpenSci Statistical Software Peer Review book.↩︎

2. Setting the random seed is not enough to compare implementations across programming languages because different languages use different kind of Random Number Generators.↩︎