Assess Your DATA QUALITY in R

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.


Skimr is my go-to R package for fast data quality assessment, and Skimr is my first step in exploratory data analysis. Before I do anything else, I check data quality with skimr.

Here are the links to get set up. 👇


(Click image to play tutorial)

Use Skimr for Data Quality
Exploratory Data Analysis

The Data Quality Report from skimr

Rapid Data Quality Checks in R
Automatic Data Quality Reporting

Data Scientists spend 80% of their time understanding data, exploring it, wrangling and preparing for analysis.

This is way too long!

We can speed this up. One tool I use in EVERY SINGLE DATA PROJECT is called skimr. It’s my go-to.

PRO TIP: I’ve added links to skimr and two more SUPER-IMPORTANT R PACKAGES FOR EDA on Page 3 of my Ultimate R Cheatsheet. 👇



You can use my Ultimate R Cheatsheet to help you learn R. It consolidates the most important R packages (ones I use every day) into one cheatsheet. Here’s where skimr is located.

How Skimr Works
Automatic Data Quality Reporting

One of the coolest features of Skimr is the ability to create a Data Quality Report in 1 line of code. This automates:

  • Date Profiling
  • Works with Numeric, Categorical, Text, Date, Nested List Columns, and even Dplyr Groups

Ultimately, this saves the Data Scientist SO MUCH TIME. ⌛

Missing Data, Categorical & Numeric Reporting (Starwars)

The “starwars” data set has a 87 starwars characters with various attributes. This is a messy data set containing a lot of missing values and nested list-columns.

Overall Data Summary
Number of Rows/Columns, Data Types by Column, Group Variables.

Character Summaries
Missing / completion rate, number of unique observations, and text features.

List Summaries (nested column)
Number of unique elements in each list.

Numeric Summaries
Missing/completion rates and distributions.

Time Series Reporting (Economics)

The “economics” data set has a date feature called “Date” and several numeric features. We’ll focus on the date feature.

Date Summaries
Missing/completion rates, min/max dates, and the number of unique dates.

Grouped Time Series Reporting (Economics Long)

The “economics_long” data set has been pivoted so each time series from “economics” is stacked on top of each other – perfect for a groupwise skim analysis.

Grouped Date Summaries
Each of these are provided by group: Missing/completion rates, min/max dates, and the number of unique dates.


Assessing data quality with skimr is like:

Just skim your data.





To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)