Why Data Validation
Data validation is a crucial step in any data science project. It ensures clean and well-formatted data that is ready for input pipelines to ML models and dashboards. Cleaned data also minimizes errors further down the line. Often, functions and model training pipelines throw errors when presented with missing values, incorrect data types, out-of-range data, etc. It’s possible to avoid the resulting time and monetary wastage through data validation techniques that ensure checks are passed before feeding the data into the program.
It’s also important to note that data validation is not a one-off occurrence. When updating ML models, new data is required. And the volume of input will likely change. Having scalable, automated validation in the workflow with every update is necessary. The question now becomes, how do you achieve all of this?
- Why Data Validation
- Data Validation With data.validator
- Getting Started
- Custom HTML Reporting
- Example of data.validator in Production
Data Validation With data.validator
Today we will look at data.validator, an R package that offers scalable and reproducible data validation in a user-friendly way. The R package data.validator handles data validation beyond simple structure and format, with reporting tools for preventative maintenance and in a way that makes it easier to identify and track the story behind the data. Some features of data.validator include:
- Validation in %>% pipelines with functions: validate_if(), validate_cols(), and validate_rows()
- Support for predicate functions from the assertr package like: in_set(), within_bounds(), etc.
- Functions for creating user-friendly reports that can be sent to email, stored in logs folder, or generated automatically with RStudio Connect
- Customizable HTML reports
There are two options to install the package:
Latest Development Version
Step 1. First, create a blank report object:
Step 2. Next, load your data set and prepare it for data validation. We will use the standard mtcars data set for this demonstration.
After creating the empty report object above, we can now start using the validate() function to perform the required validations on the dataset. We add the dataset and the name as arguments to the validate() function.
Step 3. After the validate() function, we can use the validate_*() functions and predicates to validate the data with %>% operator.
Step 4. We can also add custom predicates by first defining a function and then using it inside validate_*() functions.
Step 5. Once all the validations are done, we add the add_results(report_name) to add this validation result to the created report.
Step 6. Finally, we print the report or generate an HTML document.
We can turn off certain parts of the report like this:
We can also view the raw report like this:
data.validator provides other ways of saving the report:
Custom HTML Reporting
data.validator also supports custom report templates. Results can be shown with various interactive elements (e.g., leaflet map). In the example below, you can see the validation results from setting a predicate function to check Polish district populations that are within 3 standard deviations – assertr::within_n_sds(3).
You may find a predefined report template here. To use the template as a base, load the package in RStudio and go to File > New File > R Markdown > From template > Simple structure for HTML report summary. Here you can modify the template with custom titles and graphics.
Example of data.validator in Production
Workflow for data.validator can be implemented as follows:
- Running RStudio Connect Scheduler (daily)
- Scheduler sources the data from PostgreSQL table and validates it based on predefined rules.
- Based on validations results, a new data.validator report is created
- Data Response and Action
- Violation occurrence:
- data provider and person responsible for data quality receive a report via email
- thanks to
assertrfunctionality, the report is easily understandable both for technical and non-technical personnel
- data provider makes required data fixes
- Passes inspection:
- a specific trigger is sent in order to reload Shiny data
- Violation occurrence:
Whether your dataset was built internally or pulled from external sources, you need to check that it meets the expectations you have defined. Detecting incomplete, duplicate, corrupt, or irrelevant data can be a huge undertaking but if not addressed can negatively impact your analysis. That’s why Appsilon developed the data.validator package, to easily compose and integrate validation rules, scale for fluctuating volumes of data, and deliver clear customizable reports.
If you need assistance with your project, consider reaching out to the Appsilon Data Science Machine Learning team. Our data science professionals deliver modern ML and computer vision solutions for Fortune 500 companies. If you are a public sector institution, NGO, academic institution, or public benefit corporation working on ML projects to solve climate change and environmental degradation issues, please reach out to us through our Data for Good initiative.
We Need Your Help!
At Appsilon our Tech Team Members regularly contribute to open source packages as part of our commitment to positively impacting the world through technology. If you find our packages useful, please consider dropping a star on your favorite shiny packages at our Github. It helps let us know we’re on the right track. And if you have any comments or questions swing by our feedback threads like the ongoing discussion at our new shiny.fluent package, we love to hear from the community.
Interested in working with the leading experts in Shiny? Appsilon is looking for creative thinkers around the globe. We’re a remote-first company, with team members in 7+ countries. Our team members are leaders in the R dev community and we take our core purpose seriously.
To preserve and improve human life through exploration and technology #purpose
We promote an inclusive work environment and strive to create a friendly team with a diverse set of skills and a commitment to excellence. Contact us and see what it’s like to work on groundbreaking projects with Fortune 500 companies, NGOs, and non-profit organizations.
Appsilon is hiring for remote roles! See our Careers page for all open positions, including a React Developer and R Shiny Developers. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.