data.table version 1.8.1 – now allowed numeric columns and big-number (via bit64) in keys!

May 9, 2012
By

(This article was first published on R-statistics blog » RR-statistics blog, and kindly contributed to R-bloggers)

This is a guest post written by Branson Owen, an enthusiastic R and data.table user.

Wow, a long time desired feature of data.table finally came true in version 1.8.1! data.table now allowed numeric columns and big number (via bit64) in keys! This is quite a big thing to me and I believe to many other R users too. Now I can hardly think any weakiness of data.table. Oh, did I mention it also started to support character column in the keys (rather than coerce to factor)?

For people who are not familiar with but interested in data.table package, data.table is an enhanced data.frame for high-speed indexing, ordered joins, assignment, grouping and list columns in a short and flexible syntax. You can take a look at some task examples here:

News from datatable-help mailing list:

* New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. They are about 4 times faster than match() on the example in ?chmatch.

* New function set(DT,i,j,value) allows fast assignment to elements of DT.

   M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s

* Numeric columns (type ‘double’) are now allowed in keys and ad hoc by. Other types which use ‘double’ (such as POSIXct and bit64) can now be fully supported.

For advanced and creative users, it also officially supported list columns awhile ago (rather than support it by accident). For example, your column could be a list of vectors, where each of the vector has different length. This can allow very flexible and creative ways to manipulate data.

The code example below use “function column”, i.e. a list of functions

 > DT = data.table(ID=1:4,A=rnorm(4),B=rnorm(4),fn=list(min,max)) > str(DT) Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables: $ID: int 1 2 3 4$ A : num -0.7135 -2.5217 0.0265 1.0102 $B : num -0.4116 0.4032 0.1098 0.0669$ fn:List of 4 ..$:function (..., na.rm = FALSE) ..$ :function (..., na.rm = FALSE) ..$:function (..., na.rm = FALSE) ..$ :function (..., na.rm = FALSE)   > DT[,fn[[1]](A,B),by=ID] ID V1 [1,] 1 -0.71352508 [2,] 2 0.40322625 [3,] 3 0.02648949 [4,] 4 1.01022266