Solve common R problems efficiently with data.table

[This article was first published on Jan Gorecki - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was recently browsing stackoverflow.com (often called SO) for the most voted questions under R tag.
To my surprise, many questions on the first page were quite well addressed with the data.table package. I found a few other questions that could benefit from a data.table answer, therefore went ahead and answered them.
In this post, I’d like to summarise them along with benchmarks (where possible) and my comments if any.
Many answers under highly voted questions seem to have been posted a while back. data.table is quite actively developed and has had tons of improvements (in terms of speed and memory usage) over the recent years. It might therefore be entirely possible that some of those answers will have even better performance by now.

50 highest voted questions under R tag

Here’s the list of top 50 questions. I’ve marked those for which a data.table answer is available (which is usually quite performant).

INumber of votesQuestion titleUse data.table solution
11153How to make a great R reproducible example?
2621How to sort a dataframe by column(s)?TRUE
3496R Grouping functions: sapply vs. lapply vs. apply. vs. tapplTRUE
4429How can we make xkcd style graphs?
5396How to join (merge) data frames (inner, outer, left, right)?TRUE
6330What statistics should a programmer (or computer scientist)
7314Drop columns in R data frameTRUE
8290Tricks to manage the available memory in an R session
9280Remove rows with NAs in data.frameTRUE
10279Quickly reading very large tables as dataframes in RTRUE
11263How to properly document S4 class slots using Roxygen2?
12250Assignment operators in R: '=' and '<-'
13236Drop factor levels in a subsetted data frameTRUE
14234Plot two graphs in same plot in R
15225What is the difference between require() and library()?
16221data.table vs dplyr: can one do something well the other can
17216In R, why is [ better than subset?
18212R function for testing if a vector contains a given element
19201Expert R users, what's in your .Rprofile?
20197R list to data frameTRUE
21197Rotating and spacing axis labels in ggplot2
22197How to Correctly Use Lists in R?
23192How to convert a factor to an integer\numeric without a loss
24184How can I read command line parameters from an R script?
25184How to unload a package without restarting R?
26182Tools for making latex tables in R
27181In R, what is the difference between the [] and [[]] notatio
28180How can I view the source code for a function?
29171Cluster analysis in R: determine the optimal number of clust
30170How do I install an R package from source?
31162How do I replace NA values with zeros in R?
32152Counting the number of elements with the values of x in a ve
33152Write lines of text to a file in R
34151Standard library function in R for finding the mode?
35150How to trim leading and trailing whitespace in R?
36143How to save a plot as image on the disk?
37139Most underused data visualization
38137Convert data.frame columns from factors to charactersTRUE
39136How to find the length of a string in R?
40134Workflow for statistical analysis and report writing
41132Create an empty data.frame
42130adding leading zeros using R
43129Check existence of directory and create if doesn't exist
44127Run R script from command line
45125Changing column names of a data frame in RTRUE
46120How to set limits for axes in ggplot2 R plots?
47114How to find out which package version is loaded in R?
48112How to plot two histograms together in R?
49112How can 2 strings be concatenated in R
50112How to organize large R programs?

Below are the chosen answers where data.table can be applied. Each one supplied with the usage and timing copied from the linked answer. Click on the question title to view SO question or follow the answer link for a reproducible example and benchmark details.

How to sort a dataframe by column(s)?

Sort dataset dat by variables z and b. Use descending order for z and ascending for b.

setorder(dat, -z, b)  

Timing and memory consumption:

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------

Arun's answer

R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

Having dataset dt with variables x and grp calculate sum of x and length of x by groups specified by grp variable.

dt[, .(sum(x), .N), grp]

Timing in seconds:

#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

jangorecki's answer

How to join (merge) data frames (inner, outer, left, right)?

Join data.table dt1 and dt2 having common join column named CustomerId.

# right outer join keyed data.tables
dt1[dt2]
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]
# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]
# inner join - use `nomatch` argument
dt1[dt2, nomatch=0L, on = "CustomerId"]
# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]
# inner join
merge(dt1, dt2, by = "CustomerId")
# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)
# see ?merge.data.table arguments for other cases

Timing:

# inner join
#Unit: milliseconds
#       expr        min         lq      mean     median        uq       max neval
#       base 15546.0097 16083.4915 16687.117 16539.0148 17388.290 18513.216    10
#      sqldf 44392.6685 44709.7128 45096.401 45067.7461 45504.376 45563.472    10
#      dplyr  4124.0068  4248.7758  4281.122  4272.3619  4342.829  4411.388    10
# data.table   937.2461   946.0227  1053.411   973.0805  1214.300  1281.958    10

# left outer join
#Unit: milliseconds
#       expr       min         lq       mean     median         uq       max neval
#       base 16140.791 17107.7366 17441.9538 17414.6263 17821.9035 19453.034    10
#      sqldf 43656.633 44141.9186 44777.1872 44498.7191 45288.7406 47108.900    10
#      dplyr  4062.153  4352.8021  4780.3221  4409.1186  4450.9301  8385.050    10
# data.table   823.218   823.5557   901.0383   837.9206   883.3292  1277.239    10

# right outer join
#Unit: milliseconds
#       expr        min         lq       mean     median        uq       max neval
#       base 15821.3351 15954.9927 16347.3093 16044.3500 16621.887 17604.794    10
#      sqldf 43635.5308 43761.3532 43984.3682 43969.0081 44044.461 44499.891    10
#      dplyr  3936.0329  4028.1239  4102.4167  4045.0854  4219.958  4307.350    10
# data.table   820.8535   835.9101   918.5243   887.0207  1005.721  1068.919    10

# full outer join
#Unit: seconds
#       expr       min        lq      mean    median        uq       max neval
#       base 16.176423 16.908908 17.485457 17.364857 18.271790 18.626762    10
#      dplyr  7.610498  7.666426  7.745850  7.710638  7.832125  7.951426    10
# data.table  2.052590  2.130317  2.352626  2.208913  2.470721  2.951948    10

jangorecki's answer

Drop columns in R data frame

Drop columns a and b from dataset DT, or drop columns by names stored in a variable. set function can be also used, it will work on data.frames too.

DT[, c('a','b') := NULL]
# or
del <- c('a','b')
DT[, (del) := NULL]
# or
set(DT, j = 'b', value = NULL)

mnel's answer

No timing here as that process is almost instant, you can expect less memory consumption dropping columns with := operator or set function as it is made by reference. In R version earlier than 3.1 regular data.frame methods would copy your data in memory.

Quickly reading very large tables as dataframes in R

Reading 1 million rows dataset from csv file.

fread("test.csv")

Timing in seconds:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

mnel's answer

Remove rows with NAs in data.frame

There is na.omit data.table method which can be handly for that using cols argument.
You should not expect performance impovement over data.frame methods other than faster detection of rows to delete (NAs in this example).
Additional memory efficiency is expected in future thanks to Delete rows by reference – data.table#635.

Drop factor levels in a subsetted data frame

Drop levels for all factor columns in a dataset after making subset on it.

upd.cols = sapply(subdt, is.factor)
subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]

jangorecki's answer

No timing, don't expect speed up as the bottleneck is the factor function used to recreate the factor columns, this is also true for droplevels methods.

R list to data frame

Having a list of data.frames called ll, perform row binding on all elements of the list into single dataset.

rbindlist(ll)

Timing:

system.time(ans1 <- rbindlist(ll))
#   user  system elapsed
#  3.419   0.278   3.718

system.time(ans2 <- rbindlist(ll, use.names=TRUE))
#   user  system elapsed
#  5.311   0.471   5.914

system.time(ans3 <- do.call("rbind", ll))
#     user   system  elapsed
# 1097.895 1209.823 2438.452 

Arun's answer

Convert data.frame columns from factors to characters

This is a problem related to the fact that data.frame() by default converts character columns to factor. This is not a problem for data.table() which keeps character column class. So a question simply is already solved for data.table.

dt = data.table(col1 = c("a","b","c"), col2 = 1:3)
sapply(dt, class)

If you have a factor columns in your dataset already and you want to convert them to character you can do the following.

upd.cols = sapply(dt, is.factor)
dt[, names(dt)[upd.cols] := lapply(.SD, as.character), .SDcols = upd.cols]

jangorecki's answer

No timings, speed up is unlikely as these are quite primitive operation. If you are creating factor columns in data.frame (default) then you can get speed up with data.table as you avoid factor function which depending on factor levels can be costly. Still those operations shouldn't be a bottleneck in your workflow.

Changing column names of a data frame in R

Rename columns in dt dataset to good and better.

setnames(dt, c("good", "better"))

jangorecki's answer

No timing here, you should not expect speed up but you can expect less memory consumption due to fact that setnames update names by reference.

Didn't find a problem you are looking for?

It is very likely as I just listed the questions from the first page of the most voted – so among 50 questions only.
Before you ask new question on SO it may be wise to search for existing one because there is quite a lot already answered.

As of 11 Oct 2015, data.table was the 2nd largest tag about an R package

You can effectively search SO using [r] [data.table] my problem here in the SO search bar.
If you decide to ask a question remember about Minimal Reproducible Example (MRE). If you need help on that see the highest voted question on R tag: How to make a great R reproducible example?.
Also if you are new user to data.table you should take a look at Getting started wiki.

Final note on performance

Seeing the above timings you can get a valid impression that data.table can dramatically reduce the time (and resources) required in your data processing workflow.
Yet the above questions doesn't include few other use cases where data.table makes in fact the most stunning impression, those are:

It is good to be aware of those too.

Any comments are welcome in the blog github repo as issues.

To leave a comment for the author, please follow the link and comment on their blog: Jan Gorecki - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)