TidyR Challenge: Update

Posted on May 8, 2015 by Jeffrey Horner in R bloggers | 0 Comments

[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I described a data set that was a bit cumbersome to parse and I wanted to tidy it up before I could begin visually exploring. You know, the fun part. I wasn’t 100% happy with my solution so I candidly asked the internet to Help Me Do My Job.

One solution involved a development version of data.table; it had some issues so I’ll leave it to the reader to explore futher.

Another solution by a commenter named Riley was perfect, though. Of course I had to tweak it just a bit to drop some temporary columns, but that’s okay. I’ll take it.

Thanks Riley for Helping Me Do My Job! I owe you a beer 🙂

The Fake Data

library(wakefield)
library(tidyr)
library(dplyr)

d <- r_data_frame(
      n=100,
      id,
      r_series(date_stamp,15,name='foo_date'),
      r_series(level,15,name='foo_supply'),
      r_series(date_stamp,10,name='bar_date'),
      r_series(level,10,name='bar_supply'),
      r_series(date_stamp,3,name='baz_date'),
      r_series(level,3,name='baz_supply')
  )

Test Function for Equivalence

# Create a true ordered data frame and drop any extraneous classes for each column
true_ordered_df <- function(x){
  x$ID <- as.character(x$ID); class(x$ID) <- 'character'
  x$med_date <- as.Date(x$med_date); class(x$med_date) <- 'Date'
  x$med_supply <- as.integer(x$med_supply); class(x$med_supply) <- 'integer'
  x$med_name <- as.character(x$med_name); class(x$med_name) <- 'character'
  x <- data.frame(
          ID=x$ID, 
          med_date=x$med_date,  
          med_supply=x$med_supply, 
          med_name=x$med_name,
          stringsAsFactors=FALSE
  )
  x <- x[with(x,order(ID,med_date,med_supply,med_name)),]
  row.names(x) <- NULL
  x
}

My Original Solution

# foo
med_dates <- d %>% 
    select(ID,foo_date_1:foo_date_15) %>% 
    gather(med_seq, med_date, foo_date_1:foo_date_15)
med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,foo_supply_1:foo_supply_15) %>% 
    gather(med_seq, med_supply, foo_supply_1:foo_supply_15)
med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq))
foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
foo$med_name <- 'foo'

# bar
med_dates <- d %>% 
    select(ID,bar_date_1:bar_date_10) %>% 
    gather(med_seq, med_date, bar_date_1:bar_date_10)
med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,bar_supply_1:bar_supply_10) %>% 
    gather(med_seq, med_supply, bar_supply_1:bar_supply_10)
med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq))
bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
bar$med_name <- 'bar'

# baz
med_dates <- d %>% 
    select(ID,baz_date_1:baz_date_3) %>% 
    gather(med_seq, med_date, baz_date_1:baz_date_3)
med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,baz_supply_1:baz_supply_3) %>% 
    gather(med_seq, med_supply, baz_supply_1:baz_supply_3)
med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq))
baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
baz$med_name <- 'baz'

scripts <- true_ordered_df(rbind(foo,bar,baz))

Riley’s Solution

Produces a warning about dropping attributes, but, meh…

scripts2 <-d %>%
  # gather up all non-ID fields
  gather(Name,Value,2:ncol(.)) %>% 
  # the column names from the original data frame (now they're rows in the 
  # gathered data frame) are of the form [drug name]_[date/supply]_[sequence];
  # use 'extract' to split these out from the names
  extract(Name,c('drug','property','sequence'),
  '([[:alnum:]]+)\_([[:alnum:]]+)\_([0-9]+)') %>%
  # spread the property and value back out
  spread(property,Value) %>%
  # change the date field back to a date. for me, the gather operation
  # coerced the date fields to numeric with apparently a starting date 
  # of 1970-01-01. Also change the sequence field to int
  mutate(med_date=as.Date(date,origin='1970-01-01'),
  med_seq=as.integer(sequence), med_name=drug, med_supply=supply) %>%
  # sort the results
  arrange(ID,med_date,med_supply,med_name,sequence)

## Warning: attributes are not identical across measure variables; they will
## be dropped

scripts2 <- true_ordered_df(scripts2)

all.equal(scripts,scripts2)

## [1] TRUE

Yay!

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

TidyR Challenge: Update

The Fake Data

Test Function for Equivalence

My Original Solution

Riley’s Solution

Yay!

Related

The Fake Data

Test Function for Equivalence

My Original Solution

Riley’s Solution

Yay!

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)