TidyR Challenge: Update

[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I described a data set that was a bit cumbersome to parse and I wanted to tidy it up before I could begin visually exploring. You know, the fun part. I wasn’t 100% happy with my solution so I candidly asked the internet to Help Me Do My Job.

One solution involved a development version of data.table; it had some issues so I’ll leave it to the reader to explore futher.

Another solution by a commenter named Riley was perfect, though. Of course I had to tweak it just a bit to drop some temporary columns, but that’s okay. I’ll take it.

Thanks Riley for Helping Me Do My Job! I owe you a beer 🙂

The Fake Data

library(wakefield)
library(tidyr)
library(dplyr)

d <- r_data_frame(
      n=100,
      id,
      r_series(date_stamp,15,name='foo_date'),
      r_series(level,15,name='foo_supply'),
      r_series(date_stamp,10,name='bar_date'),
      r_series(level,10,name='bar_supply'),
      r_series(date_stamp,3,name='baz_date'),
      r_series(level,3,name='baz_supply')
  )

Test Function for Equivalence

# Create a true ordered data frame and drop any extraneous classes for each column
true_ordered_df <- function(x){
  x$ID <- as.character(x$ID); class(x$ID) <- 'character'
  x$med_date <- as.Date(x$med_date); class(x$med_date) <- 'Date'
  x$med_supply <- as.integer(x$med_supply); class(x$med_supply) <- 'integer'
  x$med_name <- as.character(x$med_name); class(x$med_name) <- 'character'
  x <- data.frame(
          ID=x$ID, 
          med_date=x$med_date,  
          med_supply=x$med_supply, 
          med_name=x$med_name,
          stringsAsFactors=FALSE
  )
  x <- x[with(x,order(ID,med_date,med_supply,med_name)),]
  row.names(x) <- NULL
  x
}

My Original Solution

# foo
med_dates <- d %>% 
    select(ID,foo_date_1:foo_date_15) %>% 
    gather(med_seq, med_date, foo_date_1:foo_date_15)
med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,foo_supply_1:foo_supply_15) %>% 
    gather(med_seq, med_supply, foo_supply_1:foo_supply_15)
med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq))
foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
foo$med_name <- 'foo'

# bar
med_dates <- d %>% 
    select(ID,bar_date_1:bar_date_10) %>% 
    gather(med_seq, med_date, bar_date_1:bar_date_10)
med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,bar_supply_1:bar_supply_10) %>% 
    gather(med_seq, med_supply, bar_supply_1:bar_supply_10)
med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq))
bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
bar$med_name <- 'bar'

# baz
med_dates <- d %>% 
    select(ID,baz_date_1:baz_date_3) %>% 
    gather(med_seq, med_date, baz_date_1:baz_date_3)
med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,baz_supply_1:baz_supply_3) %>% 
    gather(med_seq, med_supply, baz_supply_1:baz_supply_3)
med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq))
baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
baz$med_name <- 'baz'

scripts <- true_ordered_df(rbind(foo,bar,baz))

Riley’s Solution

Produces a warning about dropping attributes, but, meh…

scripts2 <-d %>%
  # gather up all non-ID fields
  gather(Name,Value,2:ncol(.)) %>% 
  # the column names from the original data frame (now they're rows in the 
  # gathered data frame) are of the form [drug name]_[date/supply]_[sequence];
  # use 'extract' to split these out from the names
  extract(Name,c('drug','property','sequence'),
  '([[:alnum:]]+)\_([[:alnum:]]+)\_([0-9]+)') %>%
  # spread the property and value back out
  spread(property,Value) %>%
  # change the date field back to a date. for me, the gather operation
  # coerced the date fields to numeric with apparently a starting date 
  # of 1970-01-01. Also change the sequence field to int
  mutate(med_date=as.Date(date,origin='1970-01-01'),
  med_seq=as.integer(sequence), med_name=drug, med_supply=supply) %>%
  # sort the results
  arrange(ID,med_date,med_supply,med_name,sequence)

## Warning: attributes are not identical across measure variables; they will
## be dropped

scripts2 <- true_ordered_df(scripts2)

all.equal(scripts,scripts2)

## [1] TRUE

Yay!

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)