Advanced Tips and Tricks with data.table
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Tips and tricks learned along the way
- 1. DATA STRUCTURES & ASSIGNMENT
- => Columns of lists
- => Suppressing intermediate output with {}
- => Fast looping with
set
- => Using
shift
for to lead/lag vectors and lists - => Create multiple columns with
:=
in one statement - => Assign a column with
:=
named with a character object - 2.
BY
- => Calculate a function over a group (using
by
) excluding each entity in a second category. - =>
keyby
to key resulting aggregate table - => Using
[1]
,[.N]
,setkey
andby
for within group subsetting - 3. FUNCTIONS
- => Passing
data.table
column names as function arguments - => Beware of scoping within data.table
- 4. PRINTING
- => Print data.table with
[]
- => Hide output from
:=
with knitr
Tips and tricks learned along the way
This is mostly a running list of data.table
tricks that took me a while to figure out either by digging into the official documentation, adapting StackOverflow posts, or more often than not, experimenting for hours. I’d like to persist these discoveries somewhere with more memory than my head (hello internet) so I can reuse them after my mental memory forgets them. A less organized and concise addition to DataCamp’s sweet cheat sheet for the basics.
Most, if not all of these techniques were developed for real data science projects and provided some value to my data engineering. I’ve generalized everything to the mtcars
dataset which might not make this value immediately clear in this slightly contrived context. This list is not intended to be comprehensive as DataCamp’s data.table cheatsheet is. OK, enough disclaimers!
Some more advanced functionality from data.table
creator Matt Dowle here.
1. DATA STRUCTURES & ASSIGNMENT
=> Columns of lists
summary table (long and narrow)
This could be useful, but is easily achievable using traditional methods.
dt <- data.table(mtcars)[, .(cyl, gear)] dt[,unique(gear), by=cyl] ## cyl V1 ## 1: 6 4 ## 2: 6 3 ## 3: 6 5 ## 4: 4 4 ## 5: 4 3 ## 6: 4 5 ## 7: 8 3 ## 8: 8 5
summary table (short and narrow)
Add all categories of gear
for each cyl
to original data.table as a list.
This is more nifty. It’s so simple, I find myself using this trick to quickly explore data ad hoc at the command line. Can also be useful for more serious data engineering.
dt <- data.table(mtcars)[,.(gear, cyl)] dt[,gearsL:=list(list(unique(gear))), by=cyl] head(dt) ## gear cyl gearsL ## 1: 4 6 4,3,5 ## 2: 4 6 4,3,5 ## 3: 4 4 4,3,5 ## 4: 3 6 4,3,5 ## 5: 3 8 3,5 ## 6: 3 6 4,3,5
Accessing elements from a column of lists
Extract second element of each list in gearL1
and create row gearL1
.
This isn’t that groundbreaking, but explores how to access elements of columns which are constructed of lists of lists. lapply
is your friend.
dt[,gearL1:=lapply(gearsL, function(x) x[2])] dt[,gearS1:=sapply(gearsL, function(x) x[2])] head(dt) ## gear cyl gearsL gearL1 gearS1 ## 1: 4 6 4,3,5 3 3 ## 2: 4 6 4,3,5 3 3 ## 3: 4 4 4,3,5 3 3 ## 4: 3 6 4,3,5 3 3 ## 5: 3 8 3,5 5 5 ## 6: 3 6 4,3,5 3 3 str(head(dt[,gearL1])) ## List of 6 ## $ : num 3 ## $ : num 3 ## $ : num 3 ## $ : num 3 ## $ : num 5 ## $ : num 3 str(head(dt[,gearS1])) ## num [1:6] 3 3 3 3 5 3
Calculate all the gear
s for all cars of each cyl
(excluding the current current row).
This can be useful for comparing observations to the mean of groups, where the group mean is not biased by the observation of interest.
dt[,other_gear:=mapply(function(x, y) setdiff(x, y), x=gearsL, y=gear)] head(dt) ## gear cyl gearsL gearL1 gearS1 other_gear ## 1: 4 6 4,3,5 3 3 3,5 ## 2: 4 6 4,3,5 3 3 3,5 ## 3: 4 4 4,3,5 3 3 3,5 ## 4: 3 6 4,3,5 3 3 4,5 ## 5: 3 8 3,5 5 5 5 ## 6: 3 6 4,3,5 3 3 4,5
=> Suppressing intermediate output with {}
This is actually a base R trick that I didn’t discover until working with data.table. See ?`{`
for some doucmentation and examples.
I’ve only used it within the J slot of data.table, it might be more generalizable. I find it pretty useful for generating columns
on the fly when I need to perform some multi-step vectorized operation. It can clean up code by allowing you to reference the same temporary variable
by a concise name rather than rewriting the code to re-compute it.
dt <- data.table(mtcars)
Defaults to just returning the last object defined in the braces unnamed.
dt[,{tmp1=mean(mpg); tmp2=mean(abs(mpg-tmp1)); tmp3=round(tmp2, 2)}, by=cyl] ## cyl V1 ## 1: 6 1.19 ## 2: 4 3.83 ## 3: 8 1.79
We can be more explicit by passing a named list of what we want to keep.
dt[,{tmp1=mean(mpg); tmp2=mean(abs(mpg-tmp1)); tmp3=round(tmp2, 2); list(tmp2=tmp2, tmp3=tmp3)}, by=cyl] ## cyl tmp2 tmp3 ## 1: 6 1.191837 1.19 ## 2: 4 3.833058 3.83 ## 3: 8 1.785714 1.79
Can also write it like this without semicolons.
dt[,{tmp1=mean(mpg) tmp2=mean(abs(mpg-tmp1)) tmp3=round(tmp2, 2) list(tmp2=tmp2, tmp3=tmp3)}, by=cyl] ## cyl tmp2 tmp3 ## 1: 6 1.191837 1.19 ## 2: 4 3.833058 3.83 ## 3: 8 1.785714 1.79
This is trickier with :=
assignments… I don’t think :=
is intended to work when wrapped in {
. Assigning multiple columns with :=
at once
does not allow you to use the first columns you create to use building the ones after it, as we did with =
inside the {
above. Chaining and then dropping unwanted variables is a messy workaround… still exploring this one.
dt <- data.table(mtcars)[,.(cyl, mpg)] dt[,tmp1:=mean(mpg), by=cyl][,tmp2:=mean(abs(mpg-tmp1)), by=cyl][,tmp1:=NULL] head(dt) ## cyl mpg tmp2 ## 1: 6 21.0 1.191837 ## 2: 6 21.0 1.191837 ## 3: 4 22.8 3.833058 ## 4: 6 21.4 1.191837 ## 5: 8 18.7 1.785714 ## 6: 6 18.1 1.191837
=> Fast looping with set
I still haven’t worked much with the loop + set
framework. I’ve been able to achieve pretty much everything with :=
which is more flexible and powerful.
However, if you must loop, set
is orders of magnitude faster than native R assignments within loops. Here’s a snippet from data.table news a while back:
New function set(DT,i,j,value) allows fast assignment to elements of DT. Similar to := but avoids the overhead of [.data.table, so is much faster inside a loop. Less flexible than :=, but as flexible as matrix subassignment. Similar in spirit to setnames(), setcolorder(), setkey() and setattr(); i.e., assigns by reference with no copy at all. M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s
data.table creators do favor set
for some things, like this task which can also be done w/ lapply
and .SD
. I was actually directed to this solution after I posed this question on StackOverflow. I was also pleased to learn that the
functionality I was looking for – applying a function to a subset of columns with .SDcols
while preserving the untouched columns – was added as a feature request.
dt <- data.table(mtcars)[,1:5, with=F] for (j in c(1L,2L,4L)) set(dt, j=j, value=-dt[[j]]) # integers using 'L' passed for efficiency for (j in c(3L,5L)) set(dt, j=j, value=paste0(dt[[j]],'!!')) head(dt) ## mpg cyl disp hp drat ## 1: -21.0 -6 160!! -110 3.9!! ## 2: -21.0 -6 160!! -110 3.9!! ## 3: -22.8 -4 108!! -93 3.85!! ## 4: -21.4 -6 258!! -110 3.08!! ## 5: -18.7 -8 360!! -175 3.15!! ## 6: -18.1 -6 225!! -105 2.76!!
=> Using shift
for to lead/lag vectors and lists
Note this feature is only available in version 1.9.5 (currently on Github, not CRAN) Base R surprisingly does not have great tools for dealing with leads/lags of vectors that most social science statistical software (Stata, SAS, even FAME which I used in my formative data years) come equipped with out of the box.
dt <- data.table(mtcars)[,.(mpg, cyl)] dt[,mpg_lag1:=shift(mpg, 1)] dt[,mpg_forward1:=shift(mpg, 1, type='lead')] head(dt) ## mpg cyl mpg_lag1 mpg_forward1 ## 1: 21.0 6 NA 21.0 ## 2: 21.0 6 21.0 22.8 ## 3: 22.8 4 21.0 21.4 ## 4: 21.4 6 22.8 18.7 ## 5: 18.7 8 21.4 18.1 ## 6: 18.1 6 18.7 14.3
shift
with by
# creating some data n <- 30 dt <- data.table( date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6), ind=rpois(n, 5), entity=sort(rep(letters[1:5], n/5)) ) setkey(dt, entity, date) # important for ordering dt[,indpct_fast:=(ind/shift(ind, 1))-1, by=entity] lagpad <- function(x, k) c(rep(NA, k), x)[1:length(x)] dt[,indpct_slow:=(ind/lagpad(ind, 1))-1, by=entity] head(dt, 10) ## date ind entity indpct_fast indpct_slow ## 1: 2010-01-01 6 a NA NA ## 2: 2011-01-01 5 a -0.1666667 -0.1666667 ## 3: 2012-01-01 4 a -0.2000000 -0.2000000 ## 4: 2013-01-01 3 a -0.2500000 -0.2500000 ## 5: 2014-01-01 3 a 0.0000000 0.0000000 ## 6: 2015-01-01 3 a 0.0000000 0.0000000 ## 7: 2010-01-01 6 b NA NA ## 8: 2011-01-01 3 b -0.5000000 -0.5000000 ## 9: 2012-01-01 4 b 0.3333333 0.3333333 ## 10: 2013-01-01 1 b -0.7500000 -0.7500000
=> Create multiple columns with :=
in one statement
This is useful, but note that that the columns operated on must be atomic vectors or lists. That is they must exist before running computation.
Building columns referencing other columns in this set need to be done indiviually or chained.
dt <- data.table(mtcars)[,.(mpg, cyl)] dt[,`:=`(avg=mean(mpg), med=median(mpg), min=min(mpg)), by=cyl] head(dt) ## mpg cyl avg med min ## 1: 21.0 6 19.74286 19.7 17.8 ## 2: 21.0 6 19.74286 19.7 17.8 ## 3: 22.8 4 26.66364 26.0 21.4 ## 4: 21.4 6 19.74286 19.7 17.8 ## 5: 18.7 8 15.10000 15.2 10.4 ## 6: 18.1 6 19.74286 19.7 17.8
=> Assign a column with :=
named with a character object
This is the advised way to assign a new column whose name you already have determined and saved as a character. Simply surround the character object in parentheses.
dt <- data.table(mtcars)[, .(cyl, mpg)] thing2 <- 'mpgx2' dt[,(thing2):=mpg*2] head(dt) ## cyl mpg mpgx2 ## 1: 6 21.0 42.0 ## 2: 6 21.0 42.0 ## 3: 4 22.8 45.6 ## 4: 6 21.4 42.8 ## 5: 8 18.7 37.4 ## 6: 6 18.1 36.2
This is old (now deprecated) way which still works for now. Not advised.
thing3 <- 'mpgx3' dt[,thing3:=mpg*3, with=F] head(dt) ## cyl mpg mpgx2 mpgx3 ## 1: 6 21.0 42.0 63.0 ## 2: 6 21.0 42.0 63.0 ## 3: 4 22.8 45.6 68.4 ## 4: 6 21.4 42.8 64.2 ## 5: 8 18.7 37.4 56.1 ## 6: 6 18.1 36.2 54.3
2. BY
=> Calculate a function over a group (using by
) excluding each entity in a second category.
This title probably doesn’t immediately make much sense. Let me explain what I’m going to calculate and why with an example.
We want to compare the mpg
of each car to the average mpg
of cars in the same class (the same # of cylinders). However, we don’t want
to bias the group mean by including the car we want to compare to the average in that average.
This assumption doesn’t appear useful in this example, but assume that gear
+cyl
uniquely identify the cars. In the real project where I faced this
problem, I was calculating an indicator related to an appraiser relative to the average of all other appraisers in their zip3. (cyl
was really zipcode
and gear
was the appraiser’s ID).
METHOD 1: in-line
0.a Biased mean: simple mean by cyl
However we want to know for each row, what is the mean among all the other cars with the same # of cyl
s, excluding that car.
dt <- data.table(mtcars)[,.(cyl, gear, mpg)] dt[, mpg_biased_mean:=mean(mpg), by=cyl] head(dt) ## cyl gear mpg mpg_biased_mean ## 1: 6 4 21.0 19.74286 ## 2: 6 4 21.0 19.74286 ## 3: 4 4 22.8 26.66364 ## 4: 6 3 21.4 19.74286 ## 5: 8 3 18.7 15.10000 ## 6: 6 3 18.1 19.74286
1.a .GRP
without setting key
dt[, dt[!gear %in% unique(dt$gear)[.GRP], mean(mpg), by=cyl], by=gear] #unbiased mean ## gear cyl V1 ## 1: 4 6 19.73333 ## 2: 4 8 15.10000 ## 3: 4 4 25.96667 ## 4: 3 6 19.74000 ## 5: 3 4 27.18000 ## 6: 3 8 15.40000 ## 7: 5 6 19.75000 ## 8: 5 4 26.32222 ## 9: 5 8 15.05000 # check dt[gear!=4 & cyl==6, mean(mpg)] ## [1] 19.73333
1.b Same as 1.a, but a little faster
uid <- unique(dt$gear) dt[, dt[!gear %in% (uid[.GRP]), mean(mpg), by=cyl] , by=gear][order(cyl, gear)] #unbiased mean ## gear cyl V1 ## 1: 3 4 27.18000 ## 2: 4 4 25.96667 ## 3: 5 4 26.32222 ## 4: 3 6 19.74000 ## 5: 4 6 19.73333 ## 6: 5 6 19.75000 ## 7: 3 8 15.40000 ## 8: 4 8 15.10000 ## 9: 5 8 15.05000
Why does this work?
# 1.a pulling it apart with .GRP dt[, .GRP, by=cyl] ## cyl .GRP ## 1: 6 1 ## 2: 4 2 ## 3: 8 3 dt[, .(.GRP, unique(dt$gear)[.GRP]), by=cyl] ## cyl .GRP V2 ## 1: 6 1 4 ## 2: 4 2 3 ## 3: 8 3 5 dt[,dt[, .(.GRP, unique(dt$gear)[.GRP]), by=cyl], by=gear] ## gear cyl .GRP V2 ## 1: 4 6 1 4 ## 2: 4 4 2 3 ## 3: 4 8 3 5 ## 4: 3 6 1 4 ## 5: 3 4 2 3 ## 6: 3 8 3 5 ## 7: 5 6 1 4 ## 8: 5 4 2 3 ## 9: 5 8 3 5
1.b Setting key
setkey(dt, gear) uid <- unique(dt$gear) dt[, dt[!.(uid[.GRP]), mean(mpg), by=cyl] , by=gear] #unbiased mean ## gear cyl V1 ## 1: 3 6 19.74000 ## 2: 3 4 27.18000 ## 3: 3 8 15.40000 ## 4: 4 6 19.73333 ## 5: 4 8 15.10000 ## 6: 4 4 25.96667 ## 7: 5 6 19.75000 ## 8: 5 8 15.05000 ## 9: 5 4 26.32222 mean(dt[cyl==4 & gear!=3,mpg]) # testing ## [1] 27.18 mean(dt[cyl==6 & gear!=3,mpg]) # testing ## [1] 19.74
METHOD 2: using {}
and .SD
{}
is used for to suppress intermediate operations.
Building up
No surpises here.
dt[, .SD[, mean(mpg)], by=gear] # same as `dt[, mean(mpg), by=gear]` ## gear V1 ## 1: 3 16.10667 ## 2: 4 24.53333 ## 3: 5 21.38000 dt[, .SD[, mean(mpg), by=cyl], by=gear] # same as `dt[, mean(mpg), by=.(cyl, by=gear)]` ## gear cyl V1 ## 1: 3 6 19.750 ## 2: 3 8 15.050 ## 3: 3 4 21.500 ## 4: 4 6 19.750 ## 5: 4 4 26.925 ## 6: 5 4 28.200 ## 7: 5 8 15.400 ## 8: 5 6 19.700
Nested data.tables and by
statements
This chunk shows what happens with two by
statements nested within two different data.tables. Explanatory purposes only - not necessary for our task.
n
counts the # of cars in that cyl
. N
counts the number of cars by cyl
and gear
.
dt[,{ vbar = sum(mpg) n = .N .SD[,.(n, .N, sum_in_gear_cyl=sum(mpg), sum_in_cyl=vbar), by=gear] } , by=cyl] ## cyl gear n N sum_in_gear_cyl sum_in_cyl ## 1: 6 3 7 2 39.5 138.2 ## 2: 6 4 7 4 79.0 138.2 ## 3: 6 5 7 1 19.7 138.2 ## 4: 8 3 14 12 180.6 211.4 ## 5: 8 5 14 2 30.8 211.4 ## 6: 4 3 11 1 21.5 293.3 ## 7: 4 4 11 8 215.4 293.3 ## 8: 4 5 11 2 56.4 293.3 dt[,sum(mpg), by=cyl] # test ## cyl V1 ## 1: 6 138.2 ## 2: 8 211.4 ## 3: 4 293.3
Calculating “unbiased mean”
This is in a summary table. This would need to be merged back onto dt
if that is desired.
dt[,{ vbar = mean(mpg) n = .N .SD[,(n*vbar-sum(mpg))/(n-.N),by=gear] } , by=cyl] ## cyl gear V1 ## 1: 6 3 19.74000 ## 2: 6 4 19.73333 ## 3: 6 5 19.75000 ## 4: 8 3 15.40000 ## 5: 8 5 15.05000 ## 6: 4 3 27.18000 ## 7: 4 4 25.96667 ## 8: 4 5 26.32222
METHOD 3: Super Fast Mean calculation
Non-function direct way
Using a vectorized approach to calculate the unbiased mean for each combination of gear
and cyl
. Mechanically,
it calculates the “biased average” for all cars by cyl
. Then subtract off the share of cars with the combination of gear
and cyl
that we want to exclude from the average and add that share. Then extrapolate out this pared down mean.
dt <- data.table(mtcars)[,.(mpg,cyl,gear)] dt[,`:=`(avg_mpg_cyl=mean(mpg), Ncyl=.N), by=cyl] dt[,`:=`(Ncylgear=.N, avg_mpg_cyl_gear=mean(mpg)), by=.(cyl, gear)] dt[,unbmean:=(avg_mpg_cyl*Ncyl-(Ncylgear*avg_mpg_cyl_gear))/(Ncyl-Ncylgear)] setkey(dt, cyl, gear) head(dt) ## mpg cyl gear avg_mpg_cyl Ncyl Ncylgear avg_mpg_cyl_gear unbmean ## 1: 21.5 4 3 26.66364 11 1 21.500 27.18000 ## 2: 22.8 4 4 26.66364 11 8 26.925 25.96667 ## 3: 24.4 4 4 26.66364 11 8 26.925 25.96667 ## 4: 22.8 4 4 26.66364 11 8 26.925 25.96667 ## 5: 32.4 4 4 26.66364 11 8 26.925 25.96667 ## 6: 30.4 4 4 26.66364 11 8 26.925 25.96667
Wrapping up code below into a function
leaveOneOutMean <- function(dt, ind, bybig, bysmall) { dtmp <- copy(dt) # copy so as not to alter original dt object w intermediate assignments dtmp <- dtmp[is.na(get(ind))==F,] dtmp[,`:=`(avg_ind_big=mean(get(ind)), Nbig=.N), by=.(get(bybig))] dtmp[,`:=`(Nbigsmall=.N, avg_ind_big_small=mean(get(ind))), by=.(get(bybig), get(bysmall))] dtmp[,unbmean:=(avg_ind_big*Nbig-(Nbigsmall*avg_ind_big_small))/(Nbig-Nbigsmall)] return(dtmp[,unbmean]) } dt <- data.table(mtcars)[,.(mpg,cyl,gear)] dt[,unbiased_mean:=leaveOneOutMean(.SD, ind='mpg', bybig='cyl', bysmall='gear')] dt[,biased_mean:=mean(mpg), by=cyl] head(dt) ## mpg cyl gear unbiased_mean biased_mean ## 1: 21.0 6 4 19.73333 19.74286 ## 2: 21.0 6 4 19.73333 19.74286 ## 3: 22.8 4 4 25.96667 26.66364 ## 4: 21.4 6 3 19.74000 19.74286 ## 5: 18.7 8 3 15.40000 15.10000 ## 6: 18.1 6 3 19.74000 19.74286
Speed check
Method 3 is roughly 100x faster than the other two. Great for this narrow task with the vectorization built in, but less generalizable; The other two methods allow any function to be passed.
dt <- data.table(mtcars) dt <- dt[sample(1:.N, 100000, replace=T), ] # increase # of rows in mtcars dt$gear <- sample(1:300, nrow(dt), replace=T) # adding in more cateogries
Method 3:
system.time(dt[,unbiased_mean_vectorized:=leaveOneOutMean(.SD, ind='mpg', bybig='cyl', bysmall='gear')]) ## user system elapsed ## 0.034 0.003 0.038
Method 2:
system.time(tmp <- dt[,dt[!gear %in% unique(dt$gear)[.GRP], mean(mpg), by=cyl], by=gear] ) ## user system elapsed ## 3.934 0.377 4.311
Method 1:
uid <- unique(dt$gear) system.time(dt[, dt[!gear %in% (uid[.GRP]), mean(mpg), by=cyl] , by=gear][order(cyl, gear)]) ## user system elapsed ## 3.527 0.346 3.874
=> keyby
to key resulting aggregate table
Without keyby
Categories are not sorted
## devtools::install_github('brooksandrew/Rsenal') library('Rsenal') # grabbing depthbin function tmp <- dt[, .(N=.N, sum=sum(vs), mean=mean(vs)/.N), by=depthbin(mpg, 5, labelOrder=T)] tmp ## depthbin N sum mean ## 1: [10.4,15.2] 1/5 24978 0 0.000000e+00 ## 2: (24.4,33.9] 5/5 18740 15591 4.439508e-05 ## 3: (17.8,21] 3/5 21934 6221 1.293077e-05 ## 4: (15.2,17.8] 2/5 15585 3089 1.271757e-05 ## 5: (21,24.4] 4/5 18763 18763 5.329638e-05 tmp[,barplot(mean, names=depthbin, las=2)]
## [,1] ## [1,] 0.7 ## [2,] 1.9 ## [3,] 3.1 ## [4,] 4.3 ## [5,] 5.5
With keyby
## devtools::install_github('brooksandrew/Rsenal') library('Rsenal') tmp <- dt[, .(N=.N, sum=sum(vs), mean=mean(vs)/.N), keyby=depthbin(mpg, 5, labelOrder=T)] tmp ## depthbin N sum mean ## 1: [10.4,15.2] 1/5 24978 0 0.000000e+00 ## 2: (15.2,17.8] 2/5 15585 3089 1.271757e-05 ## 3: (17.8,21] 3/5 21934 6221 1.293077e-05 ## 4: (21,24.4] 4/5 18763 18763 5.329638e-05 ## 5: (24.4,33.9] 5/5 18740 15591 4.439508e-05 tmp[,barplot(mean, names=depthbin, las=2)]
## [,1] ## [1,] 0.7 ## [2,] 1.9 ## [3,] 3.1 ## [4,] 4.3 ## [5,] 5.5
=> Using [1]
, [.N]
, setkey
and by
for within group subsetting
take highest value of column A when column B is highest by group
Max of qsec
for each category of cyl
(this is easy)
dt <- data.table(mtcars)[, .(cyl, mpg, qsec)] dt[, max(qsec), by=cyl] ## cyl V1 ## 1: 6 20.22 ## 2: 4 22.90 ## 3: 8 18.00
value of qsec
when mpg
is the highest per category of cyl
(this is trickier)
setkey(dt, mpg) dt[,qsec[.N], by=cyl] ## cyl V1 ## 1: 8 17.05 ## 2: 6 19.44 ## 3: 4 19.90
value of qsec
when mpg
is the lowest per category of cyl
dt[,qsec[1], by=cyl] ## cyl V1 ## 1: 8 17.98 ## 2: 6 18.90 ## 3: 4 18.60
value of qsec
when mpg
is the median per category of cyl
dt[,qsec[round(.N/2)], by=cyl] ## cyl V1 ## 1: 8 18.0 ## 2: 6 15.5 ## 3: 4 16.7
subset rows within by statement
V1
is the standard deviation of mpg
by cyl
. V2
is the standard deviation of mpg
for just the first half of mpg
.
dt <- data.table(mtcars) setkey(dt,mpg) dt[, .(sd(mpg), sd(mpg[1:round(.N/2)])), by=cyl] ## cyl V1 V2 ## 1: 8 2.560048 2.0926174 ## 2: 6 1.453567 0.8981462 ## 3: 4 4.509828 1.7728508
3. FUNCTIONS
=> Passing data.table
column names as function arguments
Method 1: No quotes, and deparse
+ substitute
This way seems more data.table-ish because it maintains the practice of not using quotes on variable names in most cases.
dt <- data.table(mtcars)[,.(cyl, mpg)] myfunc <- function(dt, v) { v2=deparse(substitute(v)) dt[,v2, with=F][[1]] # [[1]] returns a vector instead of a data.table } myfunc(dt, mpg) ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
Method 2: quotes and get
However I tend to pass through column names as characters (quoted) and use get
each time I reference that column. That can be annoying if you have a long function
repeatedly reference column names, but I often need to write such few lines of code with data.table, it hasn’t struck me as terribly unslick, yet.
dt <- data.table(mtcars) myfunc <- function(dt, v) dt[,get(v)] myfunc(dt, 'mpg') ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
=> Beware of scoping within data.table
data.frame
way
When you add something to a data.frame
within a function that exists in the global environment, it does not affect that object in the
global environment unless you return and reassign it as such, or you use the <<-
operator.
df <- mtcars[,c('cyl', 'mpg')] add_column_df <- function(df) { df$addcol1<- 'here in func!' df$addcol2 <<- 'in glob env!' return(df) }
When we call the function, we see addcol1
in the output. But not addcol2
. That’s because it’s been added to the df
in the global environment one level up.
head(add_column_df(df)) ## cyl mpg addcol1 ## Mazda RX4 6 21.0 here in func! ## Mazda RX4 Wag 6 21.0 here in func! ## Datsun 710 4 22.8 here in func! ## Hornet 4 Drive 6 21.4 here in func! ## Hornet Sportabout 8 18.7 here in func! ## Valiant 6 18.1 here in func!
Here is addcol2
, but not addcol
.
head(df) ## cyl mpg addcol2 ## Mazda RX4 6 21.0 in glob env! ## Mazda RX4 Wag 6 21.0 in glob env! ## Datsun 710 4 22.8 in glob env! ## Hornet 4 Drive 6 21.4 in glob env! ## Hornet Sportabout 8 18.7 in glob env! ## Valiant 6 18.1 in glob env!
data.table
way
Unlike data.frame, the :=
operator adds a column to both the object living in the global environment and used in the function. I think this is because
these objects are tactually the same object. data.table shaves computation time by not making copies unless explicity directed to.
dt <- data.table(mtcars) add_column_dt <- function(dat) { dat[,addcol:='sticking_to_dt!'] # hits dt in glob env return(dat) } head(add_column_dt(dt)) # addcol here ## mpg cyl disp hp drat wt qsec vs am gear carb addcol ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 sticking_to_dt! ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 sticking_to_dt! ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 sticking_to_dt! ## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 sticking_to_dt! ## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 sticking_to_dt! ## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 sticking_to_dt! head(dt) # addcol also here ## mpg cyl disp hp drat wt qsec vs am gear carb addcol ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 sticking_to_dt! ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 sticking_to_dt! ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 sticking_to_dt! ## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 sticking_to_dt! ## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 sticking_to_dt! ## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 sticking_to_dt!
So something like this renaming the local version using copy
bypasses this behavior, but is likely somewhat less efficient (and elegant). I suspect there’s a cleaner and/or faster way to do this: keep some variables
local to the function while persisting and returning other columns.
dt <- data.table(mtcars) add_column_dt <- function(dat) { datloc <- copy(dat) datloc[,addcol:='not sticking_to_dt!'] # hits dt in glob env return(datloc) } head(add_column_dt(dt)) # addcol here ## mpg cyl disp hp drat wt qsec vs am gear carb addcol ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 not sticking_to_dt! ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 not sticking_to_dt! ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 not sticking_to_dt! ## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 not sticking_to_dt! ## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 not sticking_to_dt! ## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 not sticking_to_dt! head(dt) # addcol not here ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
4. PRINTING
=> Print data.table with []
Nothing groundbreaking here, but a small miscellaneous piece of functionality.
In data.frame
world, wrapping an expression in ()
prints the output to the console. This also works with data.table, but there is another way.
In data.table
this is achieved by appending []
to the end of the expression. I find this useful because when I’m exploring at the console, I
don’t usually decide to print the output until I’m almost done and I’m already at the end of the expression I’ve written.
# data.frame way of printing after an assignment df <- head(mtcars) # doesn't print (df <- head(mtcars)) # does print ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 # data.table way of printing after an assignment dt <- data.table(head(mtcars)) # doesn't print dt[,hp2wt:=hp/wt][] # does print ## mpg cyl disp hp drat wt qsec vs am gear carb hp2wt ## 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 41.98473 ## 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 38.26087 ## 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 40.08621 ## 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 34.21462 ## 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 50.87209 ## 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 30.34682
=> Hide output from :=
with knitr
It used to be that assignments using the :=
operator printed the object to console when knitting documents with knitr
and rmarkdown
. This is actually fixed in data.table v1.9.5. However at the time of my writing, this currently not available on CRAN… only Github. For 1.9.4 users, this StackOverflow post has some hacky solutions. This least impedance approach I found was simply wrapping
the expression in invisible
. Other solutions alter the way you use data.table which I didn’t like.
dt <- data.table(mtcars) dt[,mpg2qsec:=mpg/qsec] # will print with knitr invisible(dt[,mpg2qsec:=mpg/qsec]) # won't print with knitr
This article is also published on R-Bloggers
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.