Reading files in JSON format – a comparison between R and Python
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A file format that I am seeing more and more often is the JSON (JavaScript Object Notation) format. JSON is an open standard format in human-readable form that is used to transmit data between servers and web applications. Below is a typical example of data in JSON format.
{
"votes":
{
"funny": 0,
"useful": 7,
"cool": 0
},
"user_id": "CR2y7yEm4X035ZMzrTtN9Q",
"name": "Jim",
"average_stars": 5.0,
"review_count": 6,
"type": "user"
}
In this post, I will compare the performance of R and Python when reading data in JSON format. More specifically, I will conduct an extremely simple analysis of the famous YELP Houston-based user ratings file (~216Mb), which will consist of reading the data and plotting a histogram of the ratings given by users. I tried to ensure that the workload in both scripts was as similar as possible, so that I can establish which language is most quickest.
In R:
# import required packages library("rjson") # define function read_json 'read_json' <- function() { # read json file json.file <- sprintf("%s/data/yelp_academic_dataset_review.json", getwd()) raw.json <- scan(json.file, what="raw()", sep="\n") # format json text to human-readable text json.data <- lapply(raw.json, function(x) fromJSON(x)) # extract user rating information user.rating <- unlist(lapply(json.data, function(x) x$stars)) # not shown #hist(user.rating) } # compute total time needed elapsed <- system.time(read_json()) elapsed user system elapsed 32.295 0.509 38.172
In Python:
# import modules import json import glob import os import time # start process time start = time.clock() # read in yelp data yelp_files = "%s/data/yelp_academic_dataset_review.json" % os.getcwd() yelp_data = [] with open(yelp_files) as f: for line in f: yelp_data.append(json.loads(line)) # extract user rating information user_rating = [] for item in yelp_data: user_rating.append(item[u'stars']) elapsed = (time.clock() - start) elapsed 12.520227999999996
As expected, Python was significantly faster than R (12.5s vs. 38.2s) when reading this JSON file. In fact, experience tells me that this will be the case for almost any file format… 🙂
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.