TidyR Challenge: Update
[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last post I described a data set that was a bit cumbersome to parse and I wanted to tidy it up before I could begin visually exploring. You know, the fun part. I wasn’t 100% happy with my solution so I candidly asked the internet to Help Me Do My Job.
One solution involved a development version of data.table; it had some issues so I’ll leave it to the reader to explore futher.
Another solution by a commenter named Riley was perfect, though. Of course I had to tweak it just a bit to drop some temporary columns, but that’s okay. I’ll take it.
Thanks Riley for Helping Me Do My Job! I owe you a beer 🙂
The Fake Data
library(wakefield) library(tidyr) library(dplyr) d <- r_data_frame( n=100, id, r_series(date_stamp,15,name='foo_date'), r_series(level,15,name='foo_supply'), r_series(date_stamp,10,name='bar_date'), r_series(level,10,name='bar_supply'), r_series(date_stamp,3,name='baz_date'), r_series(level,3,name='baz_supply') )
Test Function for Equivalence
# Create a true ordered data frame and drop any extraneous classes for each column true_ordered_df <- function(x){ x$ID <- as.character(x$ID); class(x$ID) <- 'character' x$med_date <- as.Date(x$med_date); class(x$med_date) <- 'Date' x$med_supply <- as.integer(x$med_supply); class(x$med_supply) <- 'integer' x$med_name <- as.character(x$med_name); class(x$med_name) <- 'character' x <- data.frame( ID=x$ID, med_date=x$med_date, med_supply=x$med_supply, med_name=x$med_name, stringsAsFactors=FALSE ) x <- x[with(x,order(ID,med_date,med_supply,med_name)),] row.names(x) <- NULL x }
My Original Solution
# foo med_dates <- d %>% select(ID,foo_date_1:foo_date_15) %>% gather(med_seq, med_date, foo_date_1:foo_date_15) med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,foo_supply_1:foo_supply_15) %>% gather(med_seq, med_supply, foo_supply_1:foo_supply_15) med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq)) foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) foo$med_name <- 'foo' # bar med_dates <- d %>% select(ID,bar_date_1:bar_date_10) %>% gather(med_seq, med_date, bar_date_1:bar_date_10) med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,bar_supply_1:bar_supply_10) %>% gather(med_seq, med_supply, bar_supply_1:bar_supply_10) med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq)) bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) bar$med_name <- 'bar' # baz med_dates <- d %>% select(ID,baz_date_1:baz_date_3) %>% gather(med_seq, med_date, baz_date_1:baz_date_3) med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,baz_supply_1:baz_supply_3) %>% gather(med_seq, med_supply, baz_supply_1:baz_supply_3) med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq)) baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) baz$med_name <- 'baz' scripts <- true_ordered_df(rbind(foo,bar,baz))
Riley’s Solution
Produces a warning about dropping attributes, but, meh…
scripts2 <-d %>% # gather up all non-ID fields gather(Name,Value,2:ncol(.)) %>% # the column names from the original data frame (now they're rows in the # gathered data frame) are of the form [drug name]_[date/supply]_[sequence]; # use 'extract' to split these out from the names extract(Name,c('drug','property','sequence'), '([[:alnum:]]+)\_([[:alnum:]]+)\_([0-9]+)') %>% # spread the property and value back out spread(property,Value) %>% # change the date field back to a date. for me, the gather operation # coerced the date fields to numeric with apparently a starting date # of 1970-01-01. Also change the sequence field to int mutate(med_date=as.Date(date,origin='1970-01-01'), med_seq=as.integer(sequence), med_name=drug, med_supply=supply) %>% # sort the results arrange(ID,med_date,med_supply,med_name,sequence) ## Warning: attributes are not identical across measure variables; they will ## be dropped scripts2 <- true_ordered_df(scripts2) all.equal(scripts,scripts2) ## [1] TRUE
Yay!
To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.