Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m thrilled to announce the release of schematic
, an R package that helps you (the developer) communicate data validation problems to non-technical users. With schematic, you can leverage tidyselect
selectors and other conveniences to compare incoming data against a schema, avoiding punishing issues caused by invalid or poor quality data.
schematic can now be installed via CRAN:
install.packages("schematic")
Learn more about schematic by checking out the docs.
< section id="motivation" class="level2">Motivation
Having built and deployed a number of shiny apps or APIs that require users to upload data, I noticed a common pain point: how do I communicate in simple terms any issues with the data and, more importantly, what those issues are? I needed a way to present the user with error messages that satisfy two needs:
- Simple and non-technical: allow developers to explain the problem rather than forcing users to understand the technical aspects of each test (you don’t want to have to explain to users what
is.logical
means). - Holistic checking: present all validation issues rather than stopping evaluation on the first failure.
There already exists a number of data validation packages for R, including (but not limited to) pointblank, data.validator, and validate; so why introduce a new player? schematic certainly shares similarities with many of these packages, but where I think it innovates over existing solutions is in its unique combination of the following:
- Lightweight: Minimal dependencies with a clear focus on checking data without the bells and whistles of graphics, tables, and whatnot.
- User-focused but developer-friendly: Developers (especially those approaching from a tidyverse mentality) will like the expressive syntax; users will appreciate the informative instructions on how to comprehensively fix data issues (no more whack-a-mole with fixing one problem only to learn there are many others).
- Easy to integrate into applications (e.g., Shiny, Plumber): Schematic returns error messages rather than reports or data.frames, meaning that you don’t need additional logic to trigger a run time error; just pass along the error message in a notification or error code.
How it works
All R errors that appear in this post are intentional for the purpose of demonstrating schematic’s error messaging.
Schematic is extremely simple. You only need to do two things: create a schema and then check a data.frame against the schema.
A schema is a set of rules for columns in a data.frame. A rule consists of two parts:
- Selector – the column(s) on which to apply to rule
- Predicate – a function that must return a single TRUE or FALSE indicating the pass or fail of the check
Let’s imagine a scenario where we have survey data and we want to ensure it matches our expectations. Here’s some sample survey data:
survey_data <- data.frame( id = c(1:3, NA, 5), name = c("Emmett", "Billy", "Sally", "Woolley", "Duchess"), age = c(19.2, 10, 22.5, 19, 19), sex = c("M", "M", "F", "M", NA), q_1 = c(TRUE, FALSE, FALSE, FALSE, TRUE), q_2 = c(FALSE, FALSE, TRUE, TRUE, TRUE), q_3 = c(TRUE, TRUE, TRUE, TRUE, FALSE) )
We declare a schema using schema()
and provide it with rules following the format selector ~ predicate
:
library(schematic) my_schema <- schema( id ~ is_incrementing, id ~ is_all_distinct, c(name, sex) ~ is.character, c(id, age) ~ is_whole_number, education ~ is.factor, sex ~ function(x) all(x %in% c("M", "F")), starts_with("q_") ~ is.logical, final_score ~ is.numeric )
Then we use check_schema
to evaluate our data against the schema. Any and all errors will be captured in the error message:
check_schema( data = survey_data, schema = my_schema )
Error in `check_schema()`: ! Schema Error: - Columns `education` and `final_score` missing from data - Column `id` failed check `is_incrementing` - Column `age` failed check `is_whole_number` - Column `sex` failed check `function(x) all(x %in% c("M", "F"))`
The error message will combine columns into a single statement if they share the same validation issue. schematic will also automatically report if any columns declared in the schema are missing from the data.
< section id="customizing-the-message" class="level2">Customizing the message
By default the error message is helpful for developers, but if you need to communicate the schema mismatch to a non-technical person they’ll have trouble understanding some or all of the errors. You can customize the output of each rule by inputting the rule as a named argument.
Let’s fix up the previous example to make the messages more understandable.
my_helpful_schema <- schema( "values are increasing" = id ~ is_incrementing, "values are all distinct" = id ~ is_all_distinct, "is a string" = c(name, sex) ~ is.character, "is a string with specific levels" = education ~ is.factor, "is a whole number (no decimals)" = c(id, age) ~ is_whole_number, "has only entries 'F' or 'M'" = sex ~ function(x) all(x %in% c("M", "F")), "includes only TRUE or FALSE" = starts_with("q_") ~ is.logical, "is a number" = final_score ~ is.numeric ) check_schema( data = survey_data, schema = my_helpful_schema )
Error in `check_schema()`: ! Schema Error: - Columns `education` and `final_score` missing from data - Column `id` failed check `values are increasing` - Column `age` failed check `is a whole number (no decimals)` - Column `sex` failed check `has only entries 'F' or 'M'`
And that’s really all there is to it. schematic does come with a few handy predicate functions like is_whole_number()
which is a more permissive version of is.integer()
that allows for columns stored as numeric or double but still requires non-decimal values.
Moreover, schematic includes a handful of modifiers that allow you to change the behavior of some predicates, for instance, allowing NAs with mod_nullable()
:
# Before using `mod_nullable()` this rule triggered an error my_schema <- schema( "all values are increasing (except empty values)" = id ~ mod_nullable(is_incrementing) ) check_schema( data = survey_data, schema = my_schema )
Conclusion
In the end, my hope is to make schematic as simple as possible and help both developers and users. It’s a package I designed initially with the sole intention of saving myself from writing validation code that takes up 80% of the actual codebase.1 I hope you find it useful too.
< section id="notes" class="level4">Notes
This post was created using R version 4.5.0 (2025-04-11) and schematic version 0.1.0.
Footnotes
Not an exaggeration. I have a Plumber API that allows users to POST data to be processed. 80% of that plumber code is to validate the incoming data.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.