Importing Large NDJSON Files into R

February 8, 2018
By

(This article was first published on RLang.io | R Language Programming, and kindly contributed to R-bloggers)

I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.

The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.

The Process

The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.

I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.

split -l 50000 data.json ./import/tweets_

Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).

head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:'

This gives you an output similar to this

"id":
"text":
"source":
"truncated":
"user":
"id":
"name":
"location":
"url":
"description":
"protected":
"verified":
"lang":
"following":
"notifications":
"geo":
"coordinates":
"place":
"contributors":
"id":
"text":
"source":
"truncated":
"user":
"id":
"name":
"location":
"url":
"description":
"protected":
"verified":
...

Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.

library("data.table")
library("parallel")
library("jsonlite")
#Parallize this process on 16 threads
cluster <- makeCluster(16)
#Export the jsonlite function stream_in to the cluster
clusterExport(cluster,list("stream_in"))
#Create an empty list for the dataframe for each file
import <- list()
#Run this function on every file in the ./import directory
import <- parLapply(cluster,list.files(path = "./import"),function(file) {
    #jsonlite function to convert the ndjson file to a dataframe
  df <- stream_in(file(paste0("./import/",file)))
    #select which columns to keep
  df <- df[,c("text","created_at","lat","lng","id_str")]
  return(df)
})
#function called from the data.table library
df <- rbindlist(import)
#Now you can stop the cluster
stopCluster(cluster)

Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming 🙂

To leave a comment for the author, please follow the link and comment on their blog: RLang.io | R Language Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)