Quickly Check your id Variables

[This article was first published on That’s so Random, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded.

library(dplyr)
some_df <- data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1))
some_df %>% unique_id(a)
## # A tibble: 2 x 3
##       a     b   val
##   <dbl> <int> <dbl>
## 1     3   103  -1.8
## 2     3   104   1.1
some_df %>% unique_id(a, b)
## [1] TRUE

Here you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github(edwinth/thatssorandom).

unique_id <- function(x, ...) {
  id_set <- x %>% select(...)
  id_set_dist <- id_set %>% distinct
  if (nrow(id_set) == nrow(id_set_dist)) {
    TRUE
  } else {
    non_unique_ids <- id_set %>% 
      filter(id_set %>% duplicated) %>% 
      distinct()
    suppressMessages(
      inner_join(non_unique_ids, x) %>% arrange(...)
    )
  }
}

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)