validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example.
library(magrittr) library(validate) iris %>% check_that( Sepal.Width > 0.5 * Sepal.Length , mean(Sepal.Width) > 0 , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) %>% summary() # rule items passes fails nNA error warning expression # 1 V1 150 66 84 0 FALSE FALSE Sepal.Width > 0.5 * Sepal.Length # 2 V2 1 1 0 0 FALSE FALSE mean(Sepal.Width) > 0 # 3 V3 150 84 66 0 FALSE FALSE !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
summary gives an overview of the number of items checked. For an aggregated test, such as the one where we test the
mean of a variable only one item is tested: the whole
Sepal.Width column. The other rules are tested on each record in
iris. Furthermore the number of items that pass, fail or could not be evaluated because of missingness are reported.
validate, data validation rules are considered objects of computation that may be stored, read, manipulated and investigated. The
validator object supports such activities so validation rules can be reused.
v <- validator( ratio = Sepal.Width > 0.5 * Sepal.Length , mean = mean(Sepal.Width) > 0 , cnd = if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10 ) v # Object of class 'validator' with 3 elements: # ratio: Sepal.Width > 0.5 * Sepal.Length # mean : mean(Sepal.Width) > 0 # cnd : !(Sepal.Width > 0.5 * Sepal.Length) | Sepal.Length > 10
We can confront the
iris data set with this validator. The results are stored in a
cf <- confront(iris, v) cf # Object of class 'validation' # Call: # confront(x = iris, dat = v) # # Confrontations: 3 # With fails : 2 # Warnings : 0 # Errors : 0 barplot(cf,main="iris")
- If this post got you interested, you can go through our introductory vignette
- Some theory on data validation can be found here
- We’d love to hear your suggestions, opinions, bugreports here
- An introduction on how to retrieve and store rules from textfiles can be found in a second vignette
- Github repo, CRAN page