Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.
The goal of my post is to compare:
utils, which was the standard way of reading csv files to R in RStudio,
readrwhich replaced the former method as a standard way of doing it in RStudio,
First let’s generate some random data
and save the files on a disk to evaluate the loading time. Besides the
csv format we will also need
Next let’s check our files sizes:
As we can see both
feather format files are taking much more storage space.
Csv more than 6 times and
feather more than 4 times comparing to
We will use
microbenchmark library to compare the reading times of the following methods:
in 10 rounds.
And the winner is…
feather! However, using
feather requires prior conversion of the file to the feather format.
readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.
When it comes to reading from
fread significantly beats
read.csv, and thus is the best option to read a
In our case we decided to go with
feather file since conversion from
csv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of
The final workflow was:
- reading a
csvfile provided by our customer using
- writing it to
- loading a
featherfile on app initialization using
First two tasks were done once and outside of a Shiny App context.
There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.