collapse might rock your world

[This article was first published on Data by John, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The NHS-R Community site prompted this post, specifically a question to the help-with-r channel on Slack.

Someone asked about obtaining unique values in a 60M row / 6 column dataset.

Naturally, data.table came up, and a few of us made suggestions for improvement, but we didn’t seem to be hitting on a magic solution.

Without being able to get my hands on the actual data, I decided to have play around at home.

I don’t know what the real life data strucure was, but this is the simple 4 column data set I produced initially, named testdata :

str(testdata)
# Classes ‘data.table’ and 'data.frame':	60000000 obs. of  4 variables:
#     $ RANDOM_STRING: chr  "aaaaa" "aaaaa" "aaaaa" "aaaaa" ...
# $ START_DATE   : Date, format: "2015-01-10" "2015-01-11" "2015-01-13" "2015-01-24" ...
# $ randint      : int  1 4 4 5 5 3 8 9 4 5 ...
# $ EXPIRY_DATE  : Date, format: "2015-11-02" "2015-10-29" "2015-10-27" "2015-09-11" ...
# - attr(*, ".internal.selfref")=<externalptr> 
#     - attr(*, "sorted")= chr [1:4] "RANDOM_STRING" "START_DATE" "EXPIRY_DATE" "randint"
# 

This was created and saved to an RDS file, and converted to a keyed data.table as part of the process.

haskey(testdata)
#[1] TRUE

First up, calling data.table::unique

system.time(uniqdat <- unique(testdata))

Timings to return 5993540 unique observations

############################## 
# user    system elapsed 
# 20.01   10.52   37.42 
##############################

Does it make any difference if I don’t assign to a variable ?

system.time(unique(testdata))

############################## 
#  user  system elapsed 
# 17.88    8.50   25.47 
##############################

Looks like it does.

Now, in the real life example, the analyst is seeing severe issues with timing, in terms of hours rather than seconds. I was researching alternative solutions and chanced upon the collapse package:

It’s pretty amazing. After a couple of false starts because I didn’t read the documentation, I successfully called the function - you have to specify the columns you want to find unique values for.

system.time(uniqdat2 <- funique(testdata,cols = 1:4))

############################## 
# user  system elapsed 
# 4.33    1.74    8.23 
##############################

Testing without assignment:

system.time(funique(testdata,cols = 1:4))
############################## 
# user    system  elapsed
# 4.66    2.60    8.00 
##############################

Pretty cool - a huge reduction in time.

From the introductory vignette:

“collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:

To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.”

I look forward to getting more familiar with this package.

Finally, out of interest, I tried dplyr timings. I had already seen that dplyr often outperforms data.table for finding unique / distinct values:

# dplyr for thoroughness

system.time(uniqdat3 <- distinct(testdata))
############################## 
# user    system  elapsed

# 14.87    2.86   20.82 
##############################

Not as fast as collapse, but faster than data.table.

I also tried without assigining to a variable, but each time I got ‘out of memory’ errors.

My personal laptop is a bit underpowered for this dataset (only 8GB RAM) and it couldn’t handle it.

So now I am in the process of upgrading my laptop ( it’s easier to swap out the SSD and RAM than get a new one), which will give me double the power, at which point I might try this again.

I’m impressed with collapse, it’s a really nice package (just look at the detail in the logo, you can tell this is the real deal) and I can’t wait to become more conversant with its functionality.

Post publishing update

I upgraded my RAM (now 16GB) and repeated the exercise:

Timings are as follows:

# user  system elapsed 
# 19.31    1.46   14.82  data.table (with assignment)
# 18.38    1.64   13.72  data.table (no assignment)

# 4.70    0.59    5.42   collapse (with assignment)
# 4.36    0.50    4.92   collapse (no assignment)


# 13.52    0.33   14.18   dplyr (with assignment)
# 12.65    0.53   13.45   dplyr (no assignment)

To leave a comment for the author, please follow the link and comment on their blog: Data by John.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)