Working with “large” datasets, with dplyr and data.table

Posted on May 4, 2015 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, “we should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil“. So usually, in my courses, and my training, codes are very basic, and easy to understand. But usually poorly efficient. Since I was challenged, to work on very large datasets, we’ve been working on R functions to manipulate those possibly (very) large dataset, and to run some simple functions as fast as possible (with simple filter and aggregation functions).

In order to illustrate, let us generate our “large” telematic dataset. Assume that we have 10,000 drivers, each of them drives about 200 times, and each time, we have, say, 80 locations. That mean around 160 million observations. It is “large”, but not huge.

> rm(list=ls())
> N_id=10000
> N_tr=200
> T_tr=80

In order to have a code as general as possible, assume that we have some kind of randomness,

> set.seed(1)
> N=rpois(N_id,N_tr)
> N_traj=rpois(sum(N),T_tr)

By “observation”, we consider a driver Id., a Trajectory Id., and a location (latitude and longitude) at some specific dates (e.g. every 15 sec.). Again, just because we want some dataset to illustrate, swe will draw drivers’s home randomly (here uniformly on some square)

> origin_lat=runif(N_id,-5,5)
> origin_lon=runif(N_id,-5,5)

And, then, from those locations, we generate a 2-dimensional random walk,

> lat=lon=Traj_Id=rep(NA,sum(N_traj))
> Pers_Id=rep(NA,length(N_traj))
> s=1
> for(i in 1:N_id){Pers_Id[s:(s+N[i])]=i;s=s+N[i]}
> s=1
> for(i in 1:length(N_traj)){lat[s:(s+N_traj[i])]=origin_lat[Pers_Id[i]]+
+  cumsum(c(0,rnorm(N_traj[i]-1,0,sd=.2)));
+  lon[s:(s+N_traj[i])]=origin_lon[Pers_Id[i]]+
+  cumsum(c(0,rnorm(N_traj[i]-1,0,sd=.2)));
+  s=s+N_traj[i]}

We have something which looks like

Then, we create variables for the Driver Id., and the Trajectory Id.,

> Pers_Id=rep(NA,sum(N_traj))
> N0=cumsum(c(0,N))
> s=1
> for(i in 1:N_id){Pers_Id[s:(s+sum(N_traj[(N0[i]+1):N0[i+1]]))]=i;s=s+sum(N_traj[(N0[i]+1):N0[i+1]])} 
> s=1
> for(i in 1:sum(N)){Traj_Id[s:(s+N_traj[i])]=i;s=s+N_traj[i]}

Now, we should store that dataset. Consider for instance a standard data frame,

> df=data.frame(Pers_Id,Traj_Id,lat,lon)

But in order to run faster codes (or supposed to be faster), we can consider a local data frame,

> library(dplyr)
> ldf=data_frame(Pers_Id,Traj_Id,lat,lon)

or a data table,

> library(data.table)
> dt=data.table(Pers_Id,Traj_Id,lat,lon)

Observe that those files are almost the same size,

> object.size(df)
3839908904 bytes

Actually, I have already created those files. The three files are stored on Dropbox, e.g. dt_json_2.RData. From that file, two alternative techniques can be considered

store the file on a local drive, and then load it

It will not be that long (even on a basic laptop)

> rm(list=ls())
> library(data.table)
> system.time( load("dt_json_2.RData") )
       user      system     elapsed
      21.53        1.33       27.71

load the file directly from Dropbox

That’s a bit tricky, but it is possible

> dropbox_ldf <- "https://dl.dropboxusercontent.com/s/l7rolinwojn37e2/ldf_json_2.RData"
> dropbox_df  <- "https://dl.dropboxusercontent.com/s/wzr19v9pyl7ah0j/df_json_2.RData"
> dropbox_dt  <- "https://dl.dropboxusercontent.com/s/nlwi1mmsxr5f23k/dt_json_2.RData"
 
> source_https <- function(loc,...){
+  curl=getCurlHandle()
+  curlSetOpt(cookiejar="cookies.txt",
+ useragent="Mozilla/5.0", followlocation=TRUE)
+  tmp <- getURLContent(loc, .opts=list(ssl.verifypeer=FALSE),curl=curl)
+  fch <- source(tmp)
+ fch
+ }

Then, you have to be (really) patient (unless you have a very good internet connexion).

Now that we have our dataset, consider the following simple problem. Given a final location, say, close to the (0,0), i.e. final distance to (0,0) smaller than 1, what was the initial location ? One the graph below, given that a trajectory ended in the black circle, what could have been the location of the starting point ?

Here, we can use the three different dataset, that rely on three different packages.

Using the (standard) data frame format

Consider here our data frame file,

> system.time(load("df_json_2.RData"))
       user      system     elapsed
      21.76        1.39       29.59

In order to get location of the initial starting point given information on the ending point, let us create two variables

> system.time(n <- nrow(df))
       user      system     elapsed
          0           0           0 
> system.time(df$first<-c(1,df$Traj_Id[2:n]!=
df$Traj_Id[1:(n-1)]))
       user      system     elapsed
       3.12        1.23        4.37 
> system.time(df$last<-c(df$Traj_Id[2:n]!=
df$Traj_Id[1:(n-1)],1))
       user      system     elapsed
       2.90        6.23       31.58

I don’t know why the second one was that long. Maybe because of memory issues. Because our dataset is now quite large

> object.size(df)
6399847720 bytes

On a Windows machine, we might start have problems, because we also want to create a variable to test wether our final location was in the black disk, or not,

> lat_0=0
> lon_0=0
> system.time(df$test <-(lat_0-df$lat)^2+(lon_0-df$lon)^2) <=1)&(df$last==1)
Error: cannot allocate vector of size 1.2 Gb/pre>

Maybe we can skip the creating of the two previous variables

> df$first <- NULL
> df$last <- NULL
> object.size(df)
3839908904 bytes
> system.time(df$test)
       user      system     elapsed
       9.39       11.10       41.17

Let us create a small vector, with the list of all trajectories

> system.time(list_Traj <- unique(df$Traj_Id[df$test==TRUE]))
       user      system     elapsed
       0.72        0.25        0.98

and then, we can extract initial location of all those trajectories

> system.time(base <- df[(df$Traj_Id %in% list_Traj)& (c(1,df$Traj_Id[2:n]!=df$Traj_Id[1:(n-1)])==1),c("lat","lon")])
       user      system     elapsed
       11.7         2.7        14.4 
> head(base)
             lat      lon
556   -0.9597891 2.469243
2866  -0.9597891 2.469243
4321  -0.9597891 2.469243
5677  -0.9597891 2.469243
9403  -0.9597891 2.469243
10432 -0.9597891 2.469243
> nrow(base0
[1]  63453

Even with 160 million lines, that was not too painful. Based on that table, it is possible to get the distribution of the initial location,

> X <- base[,c("lon","lat")]
> library(KernSmooth)
> kde2d <- bkde2D(X, bandwidth=c(bw.ucv(X[,1]),bw.ucv(X[,2])),gridsize = c(251L, 251L))
> image(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat,col=
rev(heat.colors(100)))
> contour(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat, add=TRUE)

Here it took about one minute to extract the information out of those 160 million lines. What about other formats ?

Using the data table format

Let us load our data table

> rm(list=ls())
> library(data.table)
> system.time( load("dt_json_2.RData") )
       user      system     elapsed
      21.53        1.33       27.71

First, we should specify the key,

> system.time( setkey(dt,Traj_Id) )
       user      system     elapsed
       0.38        0.09        0.47

A first idea to get the starting point of all trajectories is to use

> system.time(depart <-dt[!duplicated(Traj_Id)]) 
       user      system     elapsed
       2.48        0.65        3.14

but it is possible to use a more elegant code, to get the starting and the ending point

> system.time( depart <- dt[J(unique(Traj_Id)), mult = "first"])
       user      system     elapsed
       2.64        0.75        3.39 
> system.time( arrivee <- dt[J(unique(Traj_Id)), mult = "last"] )
       user      system     elapsed
       2.81        0.51        3.33

Then we can create a distance to (0,0) variable

> lat_0=0
> lon_0=0
> system.time( arrivee[,dist:=(lat-lat_0)^2+(lon-lon_0)^2] )
       user      system     elapsed
       0.03        0.08        1.60

and finally, create a dataset with all trajectories that ended in the black disk

> system.time( fin <- subset(arrivee,dist <= 1) )
       user      system     elapsed
       0.04        0.00        0.78

We can remove variables that we do not need

> system.time( fin[,Pers_Id:=NULL] )
       user      system     elapsed
       0.00        0.00        0.07 
> system.time( fin[,lat:=NULL] )
       user      system     elapsed
        0.0         0.0         0.2 
> system.time( fin[,lon:=NULL] )
       user      system     elapsed
          0           0           0

set the key (just in case)

> system.time( setkey(fin, Traj_Id) )
       user      system     elapsed
       0.00        0.00        0.42

and then, use a merge, with the starting point dataset

> system.time( base <<- merge(fin,depart,all.x=TRUE) )
       user      system     elapsed
       0.02        0.00        0.17

and we’re done

> system.time( base <<- merge(fin,depart,all.x=TRUE) )
       user      system     elapsed
       0.02        0.00        0.17 
> 
> head(base)
   Traj_Id       dist Pers_Id        lat      lon
1:       8 0.41251163       1 -0.9597891 2.469243
2:      36 0.34545373       1 -0.9597891 2.469243
3:      54 0.24766671       1 -0.9597891 2.469243
4:      71 0.00210023       1 -0.9597891 2.469243
5:     117 0.00755432       1 -0.9597891 2.469243
6:     130 0.82806342       1 -0.9597891 2.469243

Hopefully, we have the same dataset as previously. Here, it was quite fast, less than 10 seconds.

Using a local data frame

A finaly alternative is to use our local data frame

> rm(list=ls())
> library(dplyr)
> library(data.table)
> system.time( load("ldf_json_2.RData") )
       user      system     elapsed
      27.53        1.77       69.84

We can adapt our previous code here,

> system.time( ldepart <<- ldf %>% group_by(Traj_Id) %>%
+ summarise(first_lat=head(lat,1),
first_lon=head(lon,1)) )
       user      system     elapsed
      60.82        1.46       80.54 

> system.time( larrive <<- ldf %>% group_by(Traj_Id) %>%
+ summarise(last_lat=tail(lat,1),last_lon=tail(lon,1)) )
       user      system     elapsed
      60.81        0.31       62.15 

> lat_0=0
> lon_0=0
> system.time( system.time( larrive <<- mutate(larrive,dist=(last_lat-lat_0)^2+(last_lon-lon_0)^2) ))
       user      system     elapsed
       0.08        0.00        0.09 
> system.time( lfin <<- filter(larrive,dist<=1) )
       user      system     elapsed
       0.05        0.00        0.04

Here also, we end with a merging function

> system.time( lbase <- left_join(lfin,ldepart) )
Joining by: "Traj_Id"
       user      system     elapsed
       0.53        0.05        0.66

One more time, the output contains the same information

> head(lbase)
Source: local data frame [63,453 x 6]
 
  Traj_Id    last_lat     last_lon        dist  first_lat first_lon
1       8  0.41374639  0.491248980 0.412511638 -0.9597891  2.469243
2      36  0.58774352  0.003360806 0.345453735 -0.9597891  2.469243
3      54  0.34479069 -0.358867800 0.247666719 -0.9597891  2.469243
4      71 -0.04341135  0.014686416 0.002100236 -0.9597891  2.469243
5     117 -0.05103682 -0.070353141 0.007554322 -0.9597891  2.469243
6     130 -0.56196768 -0.715720445 0.828063425 -0.9597891  2.469243

And here, it took about 2 minutes… The longest part was to extract those first and last observations. So far, it looks like data.table is just perfect to deal with those “large” datasets.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Working with “large” datasets, with dplyr and data.table

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)