TidyR Challenge: Help Me Do My Job

May 6, 2015
By

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Last week I was handed a drug prescription data set and asked to create some interesting graphics. But before I can even get to the fun part, I was faced with actually transforming the set into something that ggplot2 could read.

Obviously I can’t share the data, but Tyler Rinker has created a fantastic package called wakefield that creates “random data sets quickly” and I was able to mimic the prescription data structure exactly!

Fake Data with Wakefield

So the fake data set below contains one row per patient (e.g. ID) with lots and lots of columns that encapsulates three different drug types (foo, bar, and baz) with their prescription fill date and prescription supply amount:

library(wakefield)
library(tidyr)
library(dplyr)

d <- r_data_frame( n=100, id, r_series(date_stamp,15,name='foo_date'), r_series(level,15,name='foo_supply'), r_series(date_stamp,10,name='bar_date'), r_series(level,10,name='bar_supply'), r_series(date_stamp,3,name='baz_date'), r_series(level,3,name='baz_supply') )

names(d)

##  [1] "ID"            "foo_date_1"    "foo_date_2"    "foo_date_3"   
##  [5] "foo_date_4"    "foo_date_5"    "foo_date_6"    "foo_date_7"   
##  [9] "foo_date_8"    "foo_date_9"    "foo_date_10"   "foo_date_11"  
## [13] "foo_date_12"   "foo_date_13"   "foo_date_14"   "foo_date_15"  
## [17] "foo_supply_1"  "foo_supply_2"  "foo_supply_3"  "foo_supply_4" 
## [21] "foo_supply_5"  "foo_supply_6"  "foo_supply_7"  "foo_supply_8" 
## [25] "foo_supply_9"  "foo_supply_10" "foo_supply_11" "foo_supply_12"
## [29] "foo_supply_13" "foo_supply_14" "foo_supply_15" "bar_date_1"   
## [33] "bar_date_2"    "bar_date_3"    "bar_date_4"    "bar_date_5"   
## [37] "bar_date_6"    "bar_date_7"    "bar_date_8"    "bar_date_9"   
## [41] "bar_date_10"   "bar_supply_1"  "bar_supply_2"  "bar_supply_3" 
## [45] "bar_supply_4"  "bar_supply_5"  "bar_supply_6"  "bar_supply_7" 
## [49] "bar_supply_8"  "bar_supply_9"  "bar_supply_10" "baz_date_1"   
## [53] "baz_date_2"    "baz_date_3"    "baz_supply_1"  "baz_supply_2" 
## [57] "baz_supply_3"

And you can see from the names of the columns there can be a maximum of 15 foo prescriptions, a maximum of 10 bar prescriptions, and a maximum of 3 baz prescriptions per patient.

How Can I Create A TidyR Data Set?

Here’s my solution, but as you can see there’s 3 sections of repeated code and surely they can be compacted into an elegant and beautiful piece of R poetry. Can you help me do my job?

# foo
med_dates <- d %>% 
    select(ID,foo_date_1:foo_date_15) %>% 
    gather(med_seq, med_date, foo_date_1:foo_date_15)
med_dates$med_seq <- as.integer(sub('^foo_date_','',med_dates$med_seq))
med_supply <- d %>% 
    select(ID,foo_supply_1:foo_supply_15) %>% 
    gather(med_seq, med_supply, foo_supply_1:foo_supply_15)
med_supply$med_seq <- as.integer(sub('^foo_supply_','',med_supply$med_seq))
foo <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% 
    select(ID,med_date,med_supply)
foo$med_name <- 'foo'

# bar

med_dates <- d %>% select(ID,bar_date_1:bar_date_10) %>% gather(med_seq, med_date, bar_date_1:bar_date_10) med_dates$med_seq <- as.integer(sub('^bar_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,bar_supply_1:bar_supply_10) %>% gather(med_seq, med_supply, bar_supply_1:bar_supply_10) med_supply$med_seq <- as.integer(sub('^bar_supply_','',med_supply$med_seq)) bar <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) bar$med_name <- 'bar'

# baz

med_dates <- d %>% select(ID,baz_date_1:baz_date_3) %>% gather(med_seq, med_date, baz_date_1:baz_date_3) med_dates$med_seq <- as.integer(sub('^baz_date_','',med_dates$med_seq)) med_supply <- d %>% select(ID,baz_supply_1:baz_supply_3) %>% gather(med_seq, med_supply, baz_supply_1:baz_supply_3) med_supply$med_seq <- as.integer(sub('^baz_supply_','',med_supply$med_seq)) baz <- left_join(med_dates,med_supply, by=c('ID','med_seq')) %>% select(ID,med_date,med_supply) baz$med_name <- 'baz'

scripts <- rbind(foo,bar,baz) scripts

## Source: local data frame [2,800 x 4]
## 
##     ID   med_date med_supply med_name
## 1  001 2014-06-06          1      foo
## 2  002 2014-06-06          2      foo
## 3  003 2014-06-06          2      foo
## 4  004 2014-06-06          1      foo
## 5  005 2014-06-06          1      foo
## 6  006 2014-06-06          4      foo
## 7  007 2014-06-06          1      foo
## 8  008 2014-06-06          1      foo
## 9  009 2014-06-06          1      foo
## 10 010 2014-06-06          2      foo
## .. ...        ...        ...      ...

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)