Reading files in JSON format – a comparison between R and Python

[This article was first published on Stat Of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A file format that I am seeing more and more often is the JSON (JavaScript Object Notation) format. JSON is an open standard format in human-readable form that is used to transmit data between servers and web applications. Below is a typical example of data in JSON format.

{"votes":

 {

  "funny": 0,

  "useful": 7,

  "cool": 0

 },

 "user_id": "CR2y7yEm4X035ZMzrTtN9Q",

 "name": "Jim",

 "average_stars": 5.0,

 "review_count": 6,

 "type": "user"

 }

In this post, I will compare the performance of R and Python when reading data in JSON format. More specifically, I will conduct an extremely simple analysis of the famous YELP Houston-based user ratings file (~216Mb), which will consist of reading the data and plotting a histogram of the ratings given by users. I tried to ensure that the workload in both scripts was as similar as possible, so that I can establish which language is most quickest.

In R:

# import required packages
library("rjson")

# define function read_json
'read_json' <- function()
{
  # read json file
  json.file <- sprintf("%s/data/yelp_academic_dataset_review.json", getwd())
  raw.json <- scan(json.file, what="raw()", sep="\n")

  # format json text to human-readable text
  json.data <- lapply(raw.json, function(x) fromJSON(x))

  # extract user rating information
  user.rating <- unlist(lapply(json.data, function(x) x$stars))

# not shown
#hist(user.rating)
}

# compute total time needed
elapsed <- system.time(read_json())
elapsed
   user  system elapsed 
 32.295   0.509  38.172 

In Python:

# import modules
import json
import glob
import os
import time

# start process time
start = time.clock()

# read in yelp data
yelp_files = "%s/data/yelp_academic_dataset_review.json" % os.getcwd()
yelp_data = []
with open(yelp_files) as f:
  for line in f:
    yelp_data.append(json.loads(line))

# extract user rating information
user_rating = []
for item in yelp_data:
  user_rating.append(item[u'stars'])

elapsed = (time.clock() - start)
elapsed 
12.520227999999996

As expected, Python was significantly faster than R (12.5s vs. 38.2s) when reading this JSON file. In fact, experience tells me that this will be the case for almost any file format… 🙂

To leave a comment for the author, please follow the link and comment on their blog: Stat Of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)