Access data quickly and easily: data.table package

June 27, 2012
By

(This article was first published on Milano R net, and kindly contributed to R-bloggers)

This article gives a brief overview of the data.table package written by M. Dowle, T. Short, S. Lianoglou.

A data.table is an extension of a data.frame created to reduce the working time of the user in two ways

  1. programming time
  2. compute time

The data.table sintax is inspired by the R syntax matrix A [B] where A is a matrix and B is a 2-column matrix.

As a data.table is a data.frame, will be compliant with all R functions and packages that accept data.frame as object.
The big advantage of a data.table than a data.frame is that it uses the tables as if they were tables in a database, with a speed of data access truly remarkable.

A data.table is created exactly like a data.frame, the sintax is the same.

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)

DF e DT are identical but on DT can create an index by defining a key.

setkey(DT,x)
tables()
NAME NROW MB COLS KEY
[1,] DT 9 1 x,y,v x
Total: 1MB

DT have been re-ordered according to the values of x column.

A key consists of one or more columns which may be integer, factor, character or some other class.
A data.tables do not have rownames but may instead have a key of one or more columns using setkey. This key may be used for row indexing instead of rownames.

Now we can subsetting data

DT["b",] # extract data for key-column = “b”
DT[,v] # extract the v column

100+ times faster than ==

A data.table is like a data.frame but i and j can be expressions of column names directly.
Furthermore i may itself be a data.table which invokes a fast table join using binary search in O(log n).

We can easily add new data

DT[,w:=1:3] # add a w column

500+ times faster than DF[i,j] = value

or join data.table

DT[J("a",3:6)] # inner join (J is an alias of data.table)

or fast grouping

DT[,sum(v),by=x]
DT[,list(vSum=sum(v),
vMin=min(v),
vMax=max(v)),
by=list(x,y)]

10+ times faster than tapply()

with a syntax much easier than in data.frame.

In a data.table each cell can be a different type

  • each cell can be vector
  • each cell can itself be a data.table
  • combining list columns with i and b


data.table(x=letters[1:3],
y=list(1:10,
letters[1:4],
data.table(a=1:3,b=4:6)))

In conclusion a data.table is identical to a data.frame other than:

  • it doesn’t have rownames
  • selecting a single row will always return a single row data.table not a vector
  • the comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table
  • [] is like a call to subset()
  • [,...], is like a call to with()

this implies

  • up to 10 times less memory
  • up to 10 times faster to create, and copy
  • simpler R code

To leave a comment for the author, please follow the link and comment on his blog: Milano R net.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.