Check Data Quality with padr

June 26, 2017
By

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

The padr package was designed to prepare datetime data for analysis. That is, to take raw, timestamped data, and quickly convert it into a tidy format that can be analyzed with all the tidyverse tools. Recently, a colleague and I discovered a second use for the package that I had not anticipated: checking data quality. Every analysis should contain checking if data are as expected. In the case of timestamped data, observations are sometimes missing due to technical malfunction of the system that produced them. Here are two examples that show how pad and thicken can be leveraged to detect problems in timestamped data quickly.

Regular observations

Lets imagine our system produces a value every five minutes. We want to analyse the data of the last couple of months and start by some routine checks. We quickly find that the number of records is not what we expected.

library(tidyverse)
library(padr)
regular_system %>% head
## # A tibble: 6 x 2
##             timestamp  value
##                  
## 1 2017-03-01 00:00:00 423.69
## 2 2017-03-01 00:05:00 434.51
## 3 2017-03-01 00:10:00 206.01
## 4 2017-03-01 00:15:00 432.83
## 5 2017-03-01 00:20:00 220.07
## 6 2017-03-01 00:25:00 393.44
seq(regular_system$timestamp %>% min, regular_system$timestamp %>% max, 
    by = "5 min") %>% length()
## [1] 32456
nrow(regular_system)
## [1] 32454

Two observations are missing here, with pad they are located in no time.

regular_system %>% 
  mutate(check_var = 1) %>% 
  pad() %>% 
  filter(is.na(check_var))
## pad applied on the interval: 5 min
## # A tibble: 2 x 3
##             timestamp value check_var
##                      
## 1 2017-06-08 11:55:00    NA        NA
## 2 2017-06-08 12:00:00    NA        NA

There we are, aparrantly the system took a lunch break on June the 8th.

Irregular observations

Now a more common situation. The system only produces data when it has something to tell us. This makes the observations irregular. This server produces a message each time some event happened.

irregular_system %>% head
## # A tibble: 6 x 2
##            time_stamp  message
##                    
## 1 2016-10-09 00:02:01  [email protected]
## 2 2016-10-09 00:07:01 #A222IWL
## 3 2016-10-09 00:11:01  [email protected]
## 4 2016-10-09 00:17:00     WW#5
## 5 2016-10-09 00:17:00  [email protected]
## 6 2016-10-09 00:17:01     WW#5

Also here are server might be temporarily down, however, this is not so easy to locate. It is helpful here to apply thicken, then aggregate, pad, and fill, and finally plot the result. We might want to look at several different interval, lets make it as generic as possible.

thicken_plot <- function(x, interval) {
  x %>% thicken(interval, "ts_thick") %>% 
    count(ts_thick) %>% 
    pad() %>% 
    fill_by_value() %>% 
    ggplot(aes(ts_thick, n)) +
    geom_line()
}

Lets look at 10 minute intervals.

thicken_plot(irregular_system, "10 min")
## pad applied on the interval: 10 min

plot of chunk unnamed-chunk-6

Thats not too helpful, maybe a little coarser.

thicken_plot(irregular_system, "30 min")
## pad applied on the interval: 30 min

plot of chunk unnamed-chunk-7

Now we see a dip at the middle of the day for October 10th, where for all the other days there is ample activty during these hours. There must be something wrong here that has to be sorted out. That will wrap up these two quick examples of how to use padr for data quality checking.

I will present the package during a lightning talk at useR next week (Wednesday 5:50 pm at 4.02 Wild Gallery). Hope to meet many of you during the conference!

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)