Solve common R problems efficiently with data.table

December 10, 2015
By

(This article was first published on Jan Gorecki - R, and kindly contributed to R-bloggers)

I was recently browsing stackoverflow.com (often called SO) for the most voted questions under R tag.
To my surprise, many questions on the first page were quite well addressed with the data.table package. I found a few other questions that could benefit from a data.table answer, therefore went ahead and answered them.
In this post, I’d like to summarise them along with benchmarks (where possible) and my comments if any.
Many answers under highly voted questions seem to have been posted a while back. data.table is quite actively developed and has had tons of improvements (in terms of speed and memory usage) over the recent years. It might therefore be entirely possible that some of those answers will have even better performance by now.

50 highest voted questions under R tag

Here’s the list of top 50 questions. I’ve marked those for which a data.table answer is available (which is usually quite performant).

I Number of votes Question title Use data.table solution
1 1153 How to make a great R reproducible example?
2 621 How to sort a dataframe by column(s)? TRUE
3 496 R Grouping functions: sapply vs. lapply vs. apply. vs. tappl TRUE
4 429 How can we make xkcd style graphs?
5 396 How to join (merge) data frames (inner, outer, left, right)? TRUE
6 330 What statistics should a programmer (or computer scientist)
7 314 Drop columns in R data frame TRUE
8 290 Tricks to manage the available memory in an R session
9 280 Remove rows with NAs in data.frame TRUE
10 279 Quickly reading very large tables as dataframes in R TRUE
11 263 How to properly document S4 class slots using Roxygen2?
12 250 Assignment operators in R: '=' and '<-'
13 236 Drop factor levels in a subsetted data frame TRUE
14 234 Plot two graphs in same plot in R
15 225 What is the difference between require() and library()?
16 221 data.table vs dplyr: can one do something well the other can
17 216 In R, why is [ better than subset?
18 212 R function for testing if a vector contains a given element
19 201 Expert R users, what's in your .Rprofile?
20 197 R list to data frame TRUE
21 197 Rotating and spacing axis labels in ggplot2
22 197 How to Correctly Use Lists in R?
23 192 How to convert a factor to an integernumeric without a loss
24 184 How can I read command line parameters from an R script?
25 184 How to unload a package without restarting R?
26 182 Tools for making latex tables in R
27 181 In R, what is the difference between the [] and [[]] notatio
28 180 How can I view the source code for a function?
29 171 Cluster analysis in R: determine the optimal number of clust
30 170 How do I install an R package from source?
31 162 How do I replace NA values with zeros in R?
32 152 Counting the number of elements with the values of x in a ve
33 152 Write lines of text to a file in R
34 151 Standard library function in R for finding the mode?
35 150 How to trim leading and trailing whitespace in R?
36 143 How to save a plot as image on the disk?
37 139 Most underused data visualization
38 137 Convert data.frame columns from factors to characters TRUE
39 136 How to find the length of a string in R?
40 134 Workflow for statistical analysis and report writing
41 132 Create an empty data.frame
42 130 adding leading zeros using R
43 129 Check existence of directory and create if doesn't exist
44 127 Run R script from command line
45 125 Changing column names of a data frame in R TRUE
46 120 How to set limits for axes in ggplot2 R plots?
47 114 How to find out which package version is loaded in R?
48 112 How to plot two histograms together in R?
49 112 How can 2 strings be concatenated in R
50 112 How to organize large R programs?

Below are the chosen answers where data.table can be applied. Each one supplied with the usage and timing copied from the linked answer. Click on the question title to view SO question or follow the answer link for a reproducible example and benchmark details.

How to sort a dataframe by column(s)?

Sort dataset dat by variables z and b. Use descending order for z and ascending for b.

setorder(dat, -z, b)  

Timing and memory consumption:

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------

Arun's answer

R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

Having dataset dt with variables x and grp calculate sum of x and length of x by groups specified by grp variable.

dt[, .(sum(x), .N), grp]

Timing in seconds:

#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

jangorecki's answer

How to join (merge) data frames (inner, outer, left, right)?

Join data.table dt1 and dt2 having common join column named CustomerId.

# right outer join keyed data.tables
dt1[dt2]
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]
# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]
# inner join - use `nomatch` argument
dt1[dt2, nomatch=0L, on = "CustomerId"]
# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]
# inner join
merge(dt1, dt2, by = "CustomerId")
# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)
# see ?merge.data.table arguments for other cases

Timing:

# inner join
#Unit: milliseconds
#       expr        min         lq      mean     median        uq       max neval
#       base 15546.0097 16083.4915 16687.117 16539.0148 17388.290 18513.216    10
#      sqldf 44392.6685 44709.7128 45096.401 45067.7461 45504.376 45563.472    10
#      dplyr  4124.0068  4248.7758  4281.122  4272.3619  4342.829  4411.388    10
# data.table   937.2461   946.0227  1053.411   973.0805  1214.300  1281.958    10

# left outer join
#Unit: milliseconds
#       expr       min         lq       mean     median         uq       max neval
#       base 16140.791 17107.7366 17441.9538 17414.6263 17821.9035 19453.034    10
#      sqldf 43656.633 44141.9186 44777.1872 44498.7191 45288.7406 47108.900    10
#      dplyr  4062.153  4352.8021  4780.3221  4409.1186  4450.9301  8385.050    10
# data.table   823.218   823.5557   901.0383   837.9206   883.3292  1277.239    10

# right outer join
#Unit: milliseconds
#       expr        min         lq       mean     median        uq       max neval
#       base 15821.3351 15954.9927 16347.3093 16044.3500 16621.887 17604.794    10
#      sqldf 43635.5308 43761.3532 43984.3682 43969.0081 44044.461 44499.891    10
#      dplyr  3936.0329  4028.1239  4102.4167  4045.0854  4219.958  4307.350    10
# data.table   820.8535   835.9101   918.5243   887.0207  1005.721  1068.919    10

# full outer join
#Unit: seconds
#       expr       min        lq      mean    median        uq       max neval
#       base 16.176423 16.908908 17.485457 17.364857 18.271790 18.626762    10
#      dplyr  7.610498  7.666426  7.745850  7.710638  7.832125  7.951426    10
# data.table  2.052590  2.130317  2.352626  2.208913  2.470721  2.951948    10

jangorecki's answer

Drop columns in R data frame

Drop columns a and b from dataset DT, or drop columns by names stored in a variable. set function can be also used, it will work on data.frames too.

DT[, c('a','b') := NULL]
# or
del <- c('a','b')
DT[, (del) := NULL]
# or
set(DT, j = 'b', value = NULL)

mnel's answer

No timing here as that process is almost instant, you can expect less memory consumption dropping columns with := operator or set function as it is made by reference. In R version earlier than 3.1 regular data.frame methods would copy your data in memory.

Quickly reading very large tables as dataframes in R

Reading 1 million rows dataset from csv file.

fread("test.csv")

Timing in seconds:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

mnel's answer

Remove rows with NAs in data.frame

There is na.omit data.table method which can be handly for that using cols argument.
You should not expect performance impovement over data.frame methods other than faster detection of rows to delete (NAs in this example).
Additional memory efficiency is expected in future thanks to Delete rows by reference – data.table#635.

Drop factor levels in a subsetted data frame

Drop levels for all factor columns in a dataset after making subset on it.

upd.cols = sapply(subdt, is.factor)
subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]

jangorecki's answer

No timing, don't expect speed up as the bottleneck is the factor function used to recreate the factor columns, this is also true for droplevels methods.

R list to data frame

Having a list of data.frames called ll, perform row binding on all elements of the list into single dataset.

rbindlist(ll)

Timing:

system.time(ans1 <- rbindlist(ll))
#   user  system elapsed
#  3.419   0.278   3.718

system.time(ans2 <- rbindlist(ll, use.names=TRUE))
#   user  system elapsed
#  5.311   0.471   5.914

system.time(ans3 <- do.call("rbind", ll))
#     user   system  elapsed
# 1097.895 1209.823 2438.452 

Arun's answer

Convert data.frame columns from factors to characters

This is a problem related to the fact that data.frame() by default converts character columns to factor. This is not a problem for data.table() which keeps character column class. So a question simply is already solved for data.table.

dt = data.table(col1 = c("a","b","c"), col2 = 1:3)
sapply(dt, class)

If you have a factor columns in your dataset already and you want to convert them to character you can do the following.

upd.cols = sapply(dt, is.factor)
dt[, names(dt)[upd.cols] := lapply(.SD, as.character), .SDcols = upd.cols]

jangorecki's answer

No timings, speed up is unlikely as these are quite primitive operation. If you are creating factor columns in data.frame (default) then you can get speed up with data.table as you avoid factor function which depending on factor levels can be costly. Still those operations shouldn't be a bottleneck in your workflow.

Changing column names of a data frame in R

Rename columns in dt dataset to good and better.

setnames(dt, c("good", "better"))

jangorecki's answer

No timing here, you should not expect speed up but you can expect less memory consumption due to fact that setnames update names by reference.

Didn't find a problem you are looking for?

It is very likely as I just listed the questions from the first page of the most voted – so among 50 questions only.
Before you ask new question on SO it may be wise to search for existing one because there is quite a lot already answered.

As of 11 Oct 2015, data.table was the 2nd largest tag about an R package

You can effectively search SO using [r] [data.table] my problem here in the SO search bar.
If you decide to ask a question remember about Minimal Reproducible Example (MRE). If you need help on that see the highest voted question on R tag: How to make a great R reproducible example?.
Also if you are new user to data.table you should take a look at Getting started wiki.

Final note on performance

Seeing the above timings you can get a valid impression that data.table can dramatically reduce the time (and resources) required in your data processing workflow.
Yet the above questions doesn't include few other use cases where data.table makes in fact the most stunning impression, those are:

It is good to be aware of those too.

Any comments are welcome in the blog github repo as issues.

To leave a comment for the author, please follow the link and comment on their blog: Jan Gorecki - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)