Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m thrilled to announce the release of schematic, an R package that helps you (the developer) communicate data validation problems to non-technical users. With schematic, you can leverage tidyselect selectors and other conveniences to compare incoming data against a schema, avoiding punishing issues caused by invalid or poor quality data.
schematic can now be installed via CRAN:
install.packages("schematic")
Learn more about schematic by checking out the docs.
< section id="motivation" class="level2">Motivation
Having built and deployed a number of shiny apps or APIs that require users to upload data, I noticed a common pain point: how do I communicate in simple terms any issues with the data and, more importantly, what those issues are? I needed a way to present the user with error messages that satisfy two needs:
- Simple and non-technical: allow developers to explain the problem rather than forcing users to understand the technical aspects of each test (you don’t want to have to explain to users what
is.logicalmeans). - Holistic checking: present all validation issues rather than stopping evaluation on the first failure.
There already exists a number of data validation packages for R, including (but not limited to) pointblank, data.validator, and validate; so why introduce a new player? schematic certainly shares similarities with many of these packages, but where I think it innovates over existing solutions is in its unique combination of the following:
- Lightweight: Minimal dependencies with a clear focus on checking data without the bells and whistles of graphics, tables, and whatnot.
- User-focused but developer-friendly: Developers (especially those approaching from a tidyverse mentality) will like the expressive syntax; users will appreciate the informative instructions on how to comprehensively fix data issues (no more whack-a-mole with fixing one problem only to learn there are many others).
- Easy to integrate into applications (e.g., Shiny, Plumber): Schematic returns error messages rather than reports or data.frames, meaning that you don’t need additional logic to trigger a run time error; just pass along the error message in a notification or error code.
How it works
All R errors that appear in this post are intentional for the purpose of demonstrating schematic’s error messaging.
Schematic is extremely simple. You only need to do two things: create a schema and then check a data.frame against the schema.
A schema is a set of rules for columns in a data.frame. A rule consists of two parts:
- Selector – the column(s) on which to apply to rule
- Predicate – a function that must return a single TRUE or FALSE indicating the pass or fail of the check
Let’s imagine a scenario where we have survey data and we want to ensure it matches our expectations. Here’s some sample survey data:
survey_data <- data.frame(
id = c(1:3, NA, 5),
name = c("Emmett", "Billy", "Sally", "Woolley", "Duchess"),
age = c(19.2, 10, 22.5, 19, 19),
sex = c("M", "M", "F", "M", NA),
q_1 = c(TRUE, FALSE, FALSE, FALSE, TRUE),
q_2 = c(FALSE, FALSE, TRUE, TRUE, TRUE),
q_3 = c(TRUE, TRUE, TRUE, TRUE, FALSE)
)
We declare a schema using schema() and provide it with rules following the format selector ~ predicate:
library(schematic)
my_schema <- schema(
id ~ is_incrementing,
id ~ is_all_distinct,
c(name, sex) ~ is.character,
c(id, age) ~ is_whole_number,
education ~ is.factor,
sex ~ function(x) all(x %in% c("M", "F")),
starts_with("q_") ~ is.logical,
final_score ~ is.numeric
)
Then we use check_schema to evaluate our data against the schema. Any and all errors will be captured in the error message:
check_schema( data = survey_data, schema = my_schema )
Error in `check_schema()`:
! Schema Error:
- Columns `education` and `final_score` missing from data
- Column `id` failed check `is_incrementing`
- Column `age` failed check `is_whole_number`
- Column `sex` failed check `function(x) all(x %in% c("M", "F"))`
The error message will combine columns into a single statement if they share the same validation issue. schematic will also automatically report if any columns declared in the schema are missing from the data.
< section id="customizing-the-message" class="level2">Customizing the message
By default the error message is helpful for developers, but if you need to communicate the schema mismatch to a non-technical person they’ll have trouble understanding some or all of the errors. You can customize the output of each rule by inputting the rule as a named argument.
Let’s fix up the previous example to make the messages more understandable.
my_helpful_schema <- schema(
"values are increasing" = id ~ is_incrementing,
"values are all distinct" = id ~ is_all_distinct,
"is a string" = c(name, sex) ~ is.character,
"is a string with specific levels" = education ~ is.factor,
"is a whole number (no decimals)" = c(id, age) ~ is_whole_number,
"has only entries 'F' or 'M'" = sex ~ function(x) all(x %in% c("M", "F")),
"includes only TRUE or FALSE" = starts_with("q_") ~ is.logical,
"is a number" = final_score ~ is.numeric
)
check_schema(
data = survey_data,
schema = my_helpful_schema
)
Error in `check_schema()`: ! Schema Error: - Columns `education` and `final_score` missing from data - Column `id` failed check `values are increasing` - Column `age` failed check `is a whole number (no decimals)` - Column `sex` failed check `has only entries 'F' or 'M'`
And that’s really all there is to it. schematic does come with a few handy predicate functions like is_whole_number() which is a more permissive version of is.integer() that allows for columns stored as numeric or double but still requires non-decimal values.
Moreover, schematic includes a handful of modifiers that allow you to change the behavior of some predicates, for instance, allowing NAs with mod_nullable():
# Before using `mod_nullable()` this rule triggered an error my_schema <- schema( "all values are increasing (except empty values)" = id ~ mod_nullable(is_incrementing) ) check_schema( data = survey_data, schema = my_schema )
Conclusion
In the end, my hope is to make schematic as simple as possible and help both developers and users. It’s a package I designed initially with the sole intention of saving myself from writing validation code that takes up 80% of the actual codebase.1 I hope you find it useful too.
< section id="notes" class="level4">Notes
This post was created using R version 4.5.0 (2025-04-11) and schematic version 0.1.0.
Footnotes
Not an exaggeration. I have a Plumber API that allows users to POST data to be processed. 80% of that plumber code is to validate the incoming data.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
