[This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#### Software can save lifes! — R and Python programming language rank in the Top10 of programming languages today. Both languages come out of open-source and research environments and are now moving into the industry. Testing software is really essential in industry. Why to talk about human readable tests?

Let me introduce you to my working environment. On a daily basis I’m writing code in R. From this code we build software that is applied in projects every day. A lot of people agree on the fact that software shall be tested. Even deeply tested until 100% code coverage is reached. OK, if you do not agree, you can stop reading now.

Additionally my work influences people’s life, directly. Not only their life, but whether they stay alive. I’m writing code in a clinical environment. So a feature in my software that was not tested could mean my software produces an outcome, that your doctor interprets wrong which can cause you pain, because he takes a wrong treatment decision. So far so good, the same accounts for a guy who coded the micro controller software of your car’s steering wheel. If you steer left because you do not want to hit a wall, you do not want your car to steer right and leave you flat and dead. Such software shall be deeply tested, too.

### The clinical environment and regulatory authorities

Now what’s special about software in a clinical environment is that each application, whether it is a medical device or a drug itself or even just the process how to produce that drug has to be checked. This checking process is guided by the government. If you are taking a drug or your blood gets analyzed by a medical device, you want authorities to make sure that it saves your life or at least makes you healthier. If you’re from the U.S., the authority is called FDA and will do that for you. In case you’re living in Europe, find out which of our 26 different authorities is responsible for you, in Germany it’s the TÜV, in France it’s the ANSM.

Let’s focus, I’ll limit the scope of these authorities a bit. I would like to talk about a classical example from clinical applications like a Urine test strip. Everybody has seen such a thing. They can test e.g. for sugar in your Urine. If there was sugar in your Urine, you should be checked for Diabetes. So, the authority has to make sure, that the whole process around the test strip improves your diagnosis. The authority makes sure the test strip can tell you if you shall be further analyzed for Diabetes. If you have Diabetes, the test strip shall at least give a hint.

Let’s make the scope a bit smaller. To evaluate the test strip after it was dipped into your Urine, a doctor plugs it into a test strip reader. The reader gives the result of your test as a number. How does it do that? There is a software evaluating the measurements of the sensor and printing the test outcome to a display due to a specific algorithm. Now regulatory authorities have to have the ability to check that:

1. the algorithm is right.
2. the algorithm was implemented right.
3. the software takes the right input from the right sensor.
4. the display of the device gets the right outcome of the software.

Step 1 basically needs good documentation of the software and the algorithm. This is a different topic. Step 2–4 can be done by software testing, in best case automated testing. Let’s assume sensor and display are working fine and are tested already.

### The test case

The device I just made up shall be tested now. It consists of a sensor and a display and in between sits a chip that runs a software.

We now have to have a set of numbers that come from the sensor which result in a set of numbers that shall be displayed on the screen of the device:

Sensor Value    Display Value
3               12
5               15
5.5             15
8               24
1               too low to evaluate
3               12
2.9             12
24.2            too high to evaluate

Now the algorithm might not be clear to you. But guess there is a detailed description available:

The algorithm of device123 shall evaluate values smaller than 2 as “too low to be evaluated”, values smaller than 5 as “12”, values smaller than 7 as “15”, values smaller than 15 as “24” and values above 15 as “too high to be evaluated”.

The task of the regulatory authority is not to test the algorithm if they shall approve device123. There job is to check that the producer of the device checked the algorithm and its software implementation. Therefore the two following things have to exist:

1. Test cases
2. A test report telling how the test cases were evaluated

Test cases in the programming language R can be written with the packages Runit or testthat. Both allow developers and testers to check the software. The test cases shown above in the code box could look like this in testthat e.g.

test_that("1 is interpreted correctly", {
expect_equal(device123(sensor=1),"too low to evaluate")
})
test_that("8 is evaluated correctly", {
expect_equal(device123(sensor=8),24)
})
...

Now for people who read R code every day, this seems great. The testthat package will tell you if your function device123 upon being called with 1 or 8 gives the exact value. The only problem is, test_that does not tell you if your test was successful, what was your expected value, what was the input. This tiny tool will just tell you how many tests were run and which failed. See the reference from Hadley Wickham’s blog:

Expectation : ...........
rv : ...
Variance : .72....

Each line represents a test file. Each . represents a passed test. Each number represents a failed test. The numbers index into a list of failures that provides more details:

1. Failure(@test-device123.R#5): 8 is interpreted correcty -----
device123(8) not equal to 24
Mean relative difference: 3

Now it assumes that all tests ran and you can check that those were successful. But the outcome is just command line stuff and just readable for people who are used to R.

I would like to come back to you as a patient. You pay the regulatory authority with your taxes. Do you expect the guy at the regulatory authority to know R or that cryptic stuff that comes out of it? Do you really expect walking into your doctors office and seeing a “proof” sign at his devices, which tells that someone at the authority looked into the code of this device?

My answer is no. I want the regulatory authority to be keen on the values the device gives to the doctor and maybe on the chemistry of the test strip, but software shall be something that works and does the described job. If it is well documented, it shall follow its documentation. Now the responsibility for the test cases lies at the side of the company writing the software. This company has to show the authority that the software was tested. The authority just has to make sure, this process was valid.

### How do we allow regulatory authorities to understand test cases?

Now we know that automated software testing allows to check if device123 has the right algorithm implemented. The major problem we have is reading the code, test the code and check if the test was valid. Testing code with code seems not to be the right option to see if it’s valid. For a company it will be hard to tell the authority, see we tested code with code. We have a bunch of cryptic command line outputs you can read that proof it.

No, you want something nice.

In case you’re a .NET developer there is a really simple solution for this. It is called specflow. It generates really easy to interpret human readable test cases:

Feature: Device123. We prepare a device that can use our algorithm to get a screen value out of a sensor value.
Scenario: Check number 1
Given the sensor measures 1 in the device
Then the result should be "too low to evaluate" on the screen
Scenario: Check number 8
Given the sensor measures 8 in the device
Then the result should be 24 on the screen

The outcome of those tests is given in pretty reports:

But I’m not a .NET developer and making use of specflow to code tests in R or Python is rather hard.

### A solution for human readable tests in R

In our team came up with a solution for human readable tests called RTest. It’s an R-package that allows to use XML files for testing other R-packages and gives reports in form of documents. We know that XML is not as nice as pseudo language, but as a beginning I think it’s a great way to start. Our XML files for the example would look like this:

<device.TestCase>
<ID>Test Case1</ID>
<synopsis>
<author>Sebastian Wolf</author>
<date>2018-05-25</date>
<desc>Test device123</desc>
</synopsis>
<tests>
<device test-desc="Test return value of input 1">
<params><sensor value="1" type="numeric"/></params>
<reference>
<variable value="too low to evaluate" type="character"/>
</reference>
</device>
<device test-desc="Test return value of input 8">
<params><sensor value="8" type="numeric"/></params>
<reference>
<variable value="24" type="numeric"/>
</reference>
</device>
</tests>
</device.TestCase>

This setup not only allows to define numerical functions to test the algorithm inside the device but also to note some basic environment information, like who really wrote this test and when he started to write the test. This information shall of course be verified by the source-code control and a co-developer.

The outcome using RTest would look similar to this:

The test report not only shows for each test how it was executed, but also the execution time, if it was successful, the reference value and the outcome. Someone who knows what the software shall do from the algorithm description can now by reading the test case and the test report, see what was tested and also see if this makes sense. For co-workers who are new to the project it is also way easier to find into the project. Reading test cases and report outcomes allows them to see in a minute which parts of the project still have problems or which functions are not yet tested.

### Summary

Understanding how R software was validated now does not need an R programmer anymore. The environment presented here allows people to see how the software was tested. I think that human readable tests will make statistical software more fail-proof, easier to understand and more sophisticated. As R’s way out of a research environment into clinical environments or even car industries took place already, the process is not finished, yet. Many more tools will be needed to allow regulatory authorities to trust in such a big open-source project. Human readable test cases are a first step in helping companies to support the validity of their open-source solutions. Using R and a good testing framework will make people’s life more safe, because you’ll have not only great statistical tools, but great validated statistical tools.

The ideas and opinions expressed in this post are those of the author alone, and are not to be construed as representing the opinions of his employer or anyone else.

Why do we need human readable tests for a programming language? was originally published in Data Driven Investor on Medium, where people are continuing the conversation by highlighting and responding to this story.