Working with “large” datasets, with dplyr and data.table
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, “we should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil“. So usually, in my courses, and my training, codes are very basic, and easy to understand. But usually poorly efficient. Since I was challenged, to work on very large datasets, we’ve been working on R functions to manipulate those possibly (very) large dataset, and to run some simple functions as fast as possible (with simple filter and aggregation functions).
In order to illustrate, let us generate our “large” telematic dataset. Assume that we have 10,000 drivers, each of them drives about 200 times, and each time, we have, say, 80 locations. That mean around 160 million observations. It is “large”, but not huge.
> rm(list=ls()) > N_id=10000 > N_tr=200 > T_tr=80
In order to have a code as general as possible, assume that we have some kind of randomness,
> set.seed(1) > N=rpois(N_id,N_tr) > N_traj=rpois(sum(N),T_tr)
By “observation”, we consider a driver Id., a Trajectory Id., and a location (latitude and longitude) at some specific dates (e.g. every 15 sec.). Again, just because we want some dataset to illustrate, swe will draw drivers’s home randomly (here uniformly on some square)
> origin_lat=runif(N_id,-5,5) > origin_lon=runif(N_id,-5,5)
And, then, from those locations, we generate a 2-dimensional random walk,
> lat=lon=Traj_Id=rep(NA,sum(N_traj)) > Pers_Id=rep(NA,length(N_traj)) > s=1 > for(i in 1:N_id){Pers_Id[s:(s+N[i])]=i;s=s+N[i]} > s=1 > for(i in 1:length(N_traj)){lat[s:(s+N_traj[i])]=origin_lat[Pers_Id[i]]+ + cumsum(c(0,rnorm(N_traj[i]-1,0,sd=.2))); + lon[s:(s+N_traj[i])]=origin_lon[Pers_Id[i]]+ + cumsum(c(0,rnorm(N_traj[i]-1,0,sd=.2))); + s=s+N_traj[i]}
We have something which looks like
Then, we create variables for the Driver Id., and the Trajectory Id.,
> Pers_Id=rep(NA,sum(N_traj)) > N0=cumsum(c(0,N)) > s=1 > for(i in 1:N_id){Pers_Id[s:(s+sum(N_traj[(N0[i]+1):N0[i+1]]))]=i;s=s+sum(N_traj[(N0[i]+1):N0[i+1]])} > s=1 > for(i in 1:sum(N)){Traj_Id[s:(s+N_traj[i])]=i;s=s+N_traj[i]}
Now, we should store that dataset. Consider for instance a standard data frame,
> df=data.frame(Pers_Id,Traj_Id,lat,lon)
But in order to run faster codes (or supposed to be faster), we can consider a local data frame,
> library(dplyr) > ldf=data_frame(Pers_Id,Traj_Id,lat,lon)
or a data table,
> library(data.table) > dt=data.table(Pers_Id,Traj_Id,lat,lon)
Observe that those files are almost the same size,
> object.size(df) 3839908904 bytes
Actually, I have already created those files. The three files are stored on Dropbox, e.g. dt_json_2.RData. From that file, two alternative techniques can be considered
- store the file on a local drive, and then load it
It will not be that long (even on a basic laptop)
> rm(list=ls()) > library(data.table) > system.time( load("dt_json_2.RData") ) user system elapsed 21.53 1.33 27.71
- load the file directly from Dropbox
That’s a bit tricky, but it is possible
> dropbox_ldf <- "https://dl.dropboxusercontent.com/s/l7rolinwojn37e2/ldf_json_2.RData" > dropbox_df <- "https://dl.dropboxusercontent.com/s/wzr19v9pyl7ah0j/df_json_2.RData" > dropbox_dt <- "https://dl.dropboxusercontent.com/s/nlwi1mmsxr5f23k/dt_json_2.RData" > source_https <- function(loc,...){ + curl=getCurlHandle() + curlSetOpt(cookiejar="cookies.txt", + useragent="Mozilla/5.0", followlocation=TRUE) + tmp <- getURLContent(loc, .opts=list(ssl.verifypeer=FALSE),curl=curl) + fch <- source(tmp) + fch + }
Then, you have to be (really) patient (unless you have a very good internet connexion).
Now that we have our dataset, consider the following simple problem. Given a final location, say, close to the (0,0), i.e. final distance to (0,0) smaller than 1, what was the initial location ? One the graph below, given that a trajectory ended in the black circle, what could have been the location of the starting point ?
Here, we can use the three different dataset, that rely on three different packages.
- Using the (standard) data frame format
Consider here our data frame file,
> system.time(load("df_json_2.RData")) user system elapsed 21.76 1.39 29.59
In order to get location of the initial starting point given information on the ending point, let us create two variables
> system.time(n <- nrow(df)) user system elapsed 0 0 0 > system.time(df$first<-c(1,df$Traj_Id[2:n]!= df$Traj_Id[1:(n-1)])) user system elapsed 3.12 1.23 4.37 > system.time(df$last<-c(df$Traj_Id[2:n]!= df$Traj_Id[1:(n-1)],1)) user system elapsed 2.90 6.23 31.58
I don’t know why the second one was that long. Maybe because of memory issues. Because our dataset is now quite large
> object.size(df) 6399847720 bytes
On a Windows machine, we might start have problems, because we also want to create a variable to test wether our final location was in the black disk, or not,
> lat_0=0 > lon_0=0 > system.time(df$test <-(lat_0-df$lat)^2+(lon_0-df$lon)^2) <=1)&(df$last==1) Error: cannot allocate vector of size 1.2 Gb/pre>
Maybe we can skip the creating of the two previous variables
> df$first <- NULL > df$last <- NULL > object.size(df) 3839908904 bytes > system.time(df$test) user system elapsed 9.39 11.10 41.17
Let us create a small vector, with the list of all trajectories
> system.time(list_Traj <- unique(df$Traj_Id[df$test==TRUE])) user system elapsed 0.72 0.25 0.98
and then, we can extract initial location of all those trajectories
> system.time(base <- df[(df$Traj_Id %in% list_Traj)& (c(1,df$Traj_Id[2:n]!=df$Traj_Id[1:(n-1)])==1),c("lat","lon")]) user system elapsed 11.7 2.7 14.4 > head(base) lat lon 556 -0.9597891 2.469243 2866 -0.9597891 2.469243 4321 -0.9597891 2.469243 5677 -0.9597891 2.469243 9403 -0.9597891 2.469243 10432 -0.9597891 2.469243 > nrow(base0 [1] 63453
Even with 160 million lines, that was not too painful. Based on that table, it is possible to get the distribution of the initial location,
> X <- base[,c("lon","lat")] > library(KernSmooth) > kde2d <- bkde2D(X, bandwidth=c(bw.ucv(X[,1]),bw.ucv(X[,2])),gridsize = c(251L, 251L)) > image(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat,col= rev(heat.colors(100))) > contour(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat, add=TRUE)
Here it took about one minute to extract the information out of those 160 million lines. What about other formats ?
- Using the data table format
Let us load our data table
> rm(list=ls()) > library(data.table) > system.time( load("dt_json_2.RData") ) user system elapsed 21.53 1.33 27.71
First, we should specify the key,
> system.time( setkey(dt,Traj_Id) ) user system elapsed 0.38 0.09 0.47
A first idea to get the starting point of all trajectories is to use
> system.time(depart <-dt[!duplicated(Traj_Id)]) user system elapsed 2.48 0.65 3.14
but it is possible to use a more elegant code, to get the starting and the ending point
> system.time( depart <- dt[J(unique(Traj_Id)), mult = "first"]) user system elapsed 2.64 0.75 3.39 > system.time( arrivee <- dt[J(unique(Traj_Id)), mult = "last"] ) user system elapsed 2.81 0.51 3.33
Then we can create a distance to (0,0) variable
> lat_0=0 > lon_0=0 > system.time( arrivee[,dist:=(lat-lat_0)^2+(lon-lon_0)^2] ) user system elapsed 0.03 0.08 1.60
and finally, create a dataset with all trajectories that ended in the black disk
> system.time( fin <- subset(arrivee,dist <= 1) ) user system elapsed 0.04 0.00 0.78
We can remove variables that we do not need
> system.time( fin[,Pers_Id:=NULL] ) user system elapsed 0.00 0.00 0.07 > system.time( fin[,lat:=NULL] ) user system elapsed 0.0 0.0 0.2 > system.time( fin[,lon:=NULL] ) user system elapsed 0 0 0
set the key (just in case)
> system.time( setkey(fin, Traj_Id) ) user system elapsed 0.00 0.00 0.42
and then, use a merge, with the starting point dataset
> system.time( base <<- merge(fin,depart,all.x=TRUE) ) user system elapsed 0.02 0.00 0.17
and we’re done
> system.time( base <<- merge(fin,depart,all.x=TRUE) ) user system elapsed 0.02 0.00 0.17 > > head(base) Traj_Id dist Pers_Id lat lon 1: 8 0.41251163 1 -0.9597891 2.469243 2: 36 0.34545373 1 -0.9597891 2.469243 3: 54 0.24766671 1 -0.9597891 2.469243 4: 71 0.00210023 1 -0.9597891 2.469243 5: 117 0.00755432 1 -0.9597891 2.469243 6: 130 0.82806342 1 -0.9597891 2.469243
Hopefully, we have the same dataset as previously. Here, it was quite fast, less than 10 seconds.
- Using a local data frame
A finaly alternative is to use our local data frame
> rm(list=ls()) > library(dplyr) > library(data.table) > system.time( load("ldf_json_2.RData") ) user system elapsed 27.53 1.77 69.84
We can adapt our previous code here,
> system.time( ldepart <<- ldf %>% group_by(Traj_Id) %>% + summarise(first_lat=head(lat,1), first_lon=head(lon,1)) ) user system elapsed 60.82 1.46 80.54 > system.time( larrive <<- ldf %>% group_by(Traj_Id) %>% + summarise(last_lat=tail(lat,1),last_lon=tail(lon,1)) ) user system elapsed 60.81 0.31 62.15 > lat_0=0 > lon_0=0 > system.time( system.time( larrive <<- mutate(larrive,dist=(last_lat-lat_0)^2+(last_lon-lon_0)^2) )) user system elapsed 0.08 0.00 0.09 > system.time( lfin <<- filter(larrive,dist<=1) ) user system elapsed 0.05 0.00 0.04
Here also, we end with a merging function
> system.time( lbase <- left_join(lfin,ldepart) ) Joining by: "Traj_Id" user system elapsed 0.53 0.05 0.66
One more time, the output contains the same information
> head(lbase) Source: local data frame [63,453 x 6] Traj_Id last_lat last_lon dist first_lat first_lon 1 8 0.41374639 0.491248980 0.412511638 -0.9597891 2.469243 2 36 0.58774352 0.003360806 0.345453735 -0.9597891 2.469243 3 54 0.34479069 -0.358867800 0.247666719 -0.9597891 2.469243 4 71 -0.04341135 0.014686416 0.002100236 -0.9597891 2.469243 5 117 -0.05103682 -0.070353141 0.007554322 -0.9597891 2.469243 6 130 -0.56196768 -0.715720445 0.828063425 -0.9597891 2.469243
And here, it took about 2 minutes… The longest part was to extract those first and last observations. So far, it looks like data.table is just perfect to deal with those “large” datasets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.