Importing Large NDJSON Files into R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.
The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.
The Process
The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.
I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.
split -l 50000 data.json ./import/tweets_
Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).
head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:'
This gives you an output similar to this
"id": "text": "source": "truncated": "user": "id": "name": "location": "url": "description": "protected": "verified": "lang": "following": "notifications": "geo": "coordinates": "place": "contributors": "id": "text": "source": "truncated": "user": "id": "name": "location": "url": "description": "protected": "verified": ...
Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.
library("data.table")
library("parallel")
library("jsonlite")
#Parallize this process on 16 threads
cluster <- makeCluster(16)
#Export the jsonlite function stream_in to the cluster
clusterExport(cluster,list("stream_in"))
#Create an empty list for the dataframe for each file
import <- list()
#Run this function on every file in the ./import directory
import <- parLapply(cluster,list.files(path = "./import"),function(file) {
#jsonlite function to convert the ndjson file to a dataframe
df <- stream_in(file(paste0("./import/",file)))
#select which columns to keep
df <- df[,c("text","created_at","lat","lng","id_str")]
return(df)
})
#function called from the data.table library
df <- rbindlist(import)
#Now you can stop the cluster
stopCluster(cluster)
Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming 
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.