Establishing Meaningful Performance Comparisons between R and Python

Posted on July 7, 2018 by R Bloggers on syknapptic in R bloggers | 0 Comments

[This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R vs Python

Performance comparisons between R and Python suck.

Most seem to be run in Jupyter Notebook and many are using Python’s rpy2 library to run poorly optimized R code. I’m not an anti-for() loop Nazi (yes, you can use them effectively in R), but thanks to the base::*apply() family and their beautiful purrr::map*() children, there are usually better solutions.

Unfortunately, some of these comparisons arbitrarily test loops in R where you would never, ever do so.

In a language where vectors serve as the fundamental data structure, it doesn’t make any sense why code like this receives such prominent treatment in seemingly every test…..

normal_distibution <- rnorm(2500)

bad_R <- vector(mode = "numeric", length = length(normal_distibution))

for(i in normal_distibution) {
  bad_R[i] <- normal_distibution[i] * normal_distibution[i]
}

If we had to do something explicitly “loopy”, we’d still probably do something like this…

not_so_good_R <- vapply(normal_distibution, function(x) x^2, numeric(1))

identical(bad_R, not_so_good_R)
## [1] TRUE

… but it’s still taking advantage of the fact that normal_distibution is a homogeneous collection of atomic values: a vector.

all(is.vector(normal_distibution), is.atomic(normal_distibution))
## [1] TRUE

With that in mind, just do this…

good_R <- normal_distibution^2

identical(bad_R, good_R)
## [1] TRUE

In Python, using reticulate here, we can do this in a whole bunch of ways…

py_run_string(
"
normal_distibution_py = r.normal_distibution

py_index_results = [None]*len(normal_distibution_py)
py_append_results = []
py_dict_results = {}

", convert = FALSE)

py_loop_index <- (
"for i in range(len(normal_distibution_py)):
  py_index_results[i] = normal_distibution_py[i]**2
")

py_loop_append <- (
"for i in normal_distibution_py:
  py_append_results.append(i**2)
")

py_loop_dict <- (
"for i in range(len(normal_distibution_py)):
  py_dict_results[i] =  normal_distibution_py[i]**2
")

py_list_comp <- (
"
[x**2 for x in normal_distibution_py]
"
)

… but what runs fastest?

speeds <- mark(
  for(i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal_distibution[i],
  vapply(normal_distibution, function(x) x^2, numeric(1)),
  normal_distibution^2,
  
  py_run_string(py_loop_index, convert = FALSE),
  py_run_string(py_loop_append, convert = FALSE),
  py_run_string(py_loop_dict, convert = FALSE),
  py_run_string(py_list_comp),
  
  check = FALSE, iterations = 100
  )

Table 1: “Looping” Comparison
	expression	mean	median
Good R	normal_distibution^2	2.02us	1.98us
Python	[x**2 for x in normal_distibution_py]	675.43us	595.75us
Python	for i in normal_distibution_py: py_append_results.append(i**2)	935.9us	864.79us
Python	for i in range(len(normal_distibution_py)): py_dict_results[i] = normal_distibution_py[i]**2	1.2ms	1.11ms
Python	for i in range(len(normal_distibution_py)): py_index_results[i] = normal_distibution_py[i]**2	1.37ms	1.16ms
Not-Good R	vapply(normal_distibution, function(x) x^2, numeric(1))	2.18ms	1.79ms
Bad R	for (i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal…	55.08ms	49.69ms

In these conditions and for this task, we can say two things:

All the Python solutions are faster than the poorly-optimized R solutions.
The optimized R solution is faster than all the Python solutions.

That said, there are issues with this test.

Are we really testing the same thing?

In terms of the exact steps that a computer takes to crunch the numbers? No, but that’s not very realistic or useful.

In terms of reaching a desired result? Ignoring that pure Python list()s are not inherently homogeneous, yes.

py_run_string("py_append_results = []")
py_run_string(py_loop_append)
all.equal(good_R, py$py_append_results)
## [1] TRUE

Is running the Python code through R’s reticulate actually fair?

Is it less fair than running rpy2 in Python? After running all these tests, I’d say that reticulate is fairer.

Is this even a good task to compare performance?

Based on the number of articles including a similar test, you’d almost think so. I don’t entirely agree as that’s a bit reductionist. The R solution is only the variable followed by literally two characters: ^2.

But, I do think it serves as a great example of fundamental differences in the languages.

Considering the above results and simplicity of the good R solution, it illustrates how easily you can place arbitrary handicaps on the R code, which you’ll find in many of these “language war” articles. I hope that’s simply due to ignorant assumptions, but then the author shouldn’t be writing an article claiming authority.

While there are articles that do make a point of notifying the reader that the tests are lacking, some will sell the results as gospel anyways. Others seem to dismiss the merits of rigor entirely.

In a field referred to as “Data Science”, the mountain of articles discussing such poor metrics is concerning. Consider how many newcomers seem to use them when choosing a language in which to invest their time, and often money. (BTW the answer is both, but get great at one before tackling the other).

With that in mind, what would an objective comparison look like?

Here’s a barrage of tests applied to a task that’s both common in practice and common in these “language war” tests: reading a .csv file to a data frame. This is a task for which many articles assert Python’s superiority, despite the evidence here and elsewhere.

However, the real goal is to experiment with methods that can be used to make future tests involving less trivial tasks more objective and thus more useful to everyone.

I also think it’s a cool demonstration of some RStudio and {reticulate} sweetness. I hope it spurs some interest in how awesome a multilingual workflow can be.

If you want to skip a pile of monotonous code, go ahead and jump to the results.

Otherwise, the entire workflow is here to scrutinize…

library(bench)
library(kableExtra); options(knitr.kable.NA = "")
library(scales)
library(tidyverse)

Reproducible Python Environment

library(reticulate)
conda_create("r-py-benchmarks", c("python=3.6", "numpy", "pandas"))
use_condaenv("r-py-benchmarks", required = TRUE)

The Data

The data come from a neutral third-party in the form of .csv, which can be obtained from Majestic Million CSV .

Download and Read Data Set

file_url <- "http://downloads.majestic.com/majestic_million.csv"
temp_file <- tempfile(fileext = ".csv")

download.file(file_url, destfile = temp_file)

test_df <- read_csv(temp_file)

Quick Inspection

glimpse(test_df)
## Observations: 1,000,000
## Variables: 12
## $ GlobalRank     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ TldRank        <int> 1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,...
## $ Domain         <chr> "google.com", "facebook.com", "youtube.com", "t...
## $ TLD            <chr> "com", "com", "com", "com", "com", "com", "org"...
## $ RefSubNets     <int> 463232, 451237, 410764, 409068, 303679, 292966,...
## $ RefIPs         <int> 2963708, 3046847, 2444016, 2546940, 1139322, 13...
## $ IDN_Domain     <chr> "google.com", "facebook.com", "youtube.com", "t...
## $ IDN_TLD        <chr> "com", "com", "com", "com", "com", "com", "org"...
## $ PrevGlobalRank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ PrevTldRank    <int> 1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,...
## $ PrevRefSubNets <int> 462861, 451086, 410676, 408692, 303296, 292918,...
## $ PrevRefIPs     <int> 2966284, 3049605, 2447455, 2549623, 1138675, 13...
test_df %>%
  summarise_all(funs(sum(is.na(.)))) %>% # where the NAs at?
  gather(Variable, NAs) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

Variable	NAs
GlobalRank	0
TldRank	0
Domain	0
TLD	0
RefSubNets	0
RefIPs	0
IDN_Domain	0
IDN_TLD	0
PrevGlobalRank	0
PrevTldRank	0
PrevRefSubNets	0
PrevRefIPs	0

Write the .csv

write_csv(test_df, path)

Small

The “small” .csv consists of the first 100 rows.

(small_df <- test_df %>% 
  slice(1:100))
## # A tibble: 100 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##         <int>   <int> <chr>  <chr>      <int>  <int> <chr>      <chr>  
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 90 more rows, and 4 more variables: PrevGlobalRank <int>,
## #   PrevTldRank <int>, PrevRefSubNets <int>, PrevRefIPs <int>
(small_rows <- nrow(small_df)) %>% comma() %>% cat("rows")
## 100 rows
path_small_csv <- "test-data/small_csv.csv"
write_csv(small_df, path_small_csv)

Medium

The “medium” .csv consists of the first 5,000 rows.

(medium_df <- test_df %>% 
  slice(1:5000))
## # A tibble: 5,000 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##         <int>   <int> <chr>  <chr>      <int>  <int> <chr>      <chr>  
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 4,990 more rows, and 4 more variables: PrevGlobalRank <int>,
## #   PrevTldRank <int>, PrevRefSubNets <int>, PrevRefIPs <int>
(med_rows <- nrow(medium_df)) %>% comma() %>% cat("rows")
## 5,000 rows
path_medium_csv <- "test-data/medium_csv.csv"
write_csv(medium_df, path_medium_csv)

Big

The “big” .csv stacks all 1,000,000 rows five times, creating a 5,000,000 row .csv.

(big_df <- test_df %>% 
  rerun(.n = 5) %>% 
  bind_rows())
## # A tibble: 5,000,000 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##         <int>   <int> <chr>  <chr>      <int>  <int> <chr>      <chr>  
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 4,999,990 more rows, and 4 more variables:
## #   PrevGlobalRank <int>, PrevTldRank <int>, PrevRefSubNets <int>,
## #   PrevRefIPs <int>
(big_rows <- nrow(big_df)) %>% comma() %>% cat("rows")
## 5,000,000 rows
path_big_csv <- "test-data/big_csv.csv"
write_csv(big_df, path_big_csv)

The Code

The following steps were taken to “standardize” code.

R and Python functions:
1. File paths are assigned to a "*_csv.csv" variable.
2. The column data types are identified ahead of time via a *_col_specs variable in order to maximize read speed. In future tests, it would be interesting to skip this step.
  - All “numeric” data are read as double via:
    - "double" for utils::read.csv() and data.table::fread()
    - readr::col_double() for readr::read_csv()
    - float for pandas.read_csv()
  - This is to standardize numeric usage as my understanding is that both R’s doubles and Python’s floats are doubles in the underlying C code. It also prevents the need to import numpy in every call to a Python script. If this is incorrect, don’t hesitate to say so.
3. The function assigns the result to an internal df variable.
4. The function explicitly return()s the data frame.
.R and .py Script Execution:
- .R scripts are called via system() instead of source() as source() appeared to offer a potentially unfair advantage.
- Similarly, .py scripts were tested via system(), reticulate::py_run_file(), and reticulate::py_run_string() instead of reticulate::source_python(), to minimize the amount of steps required for execution and minimize potential handicaps.
.R and .py Script Code:
1. Relevant package are loaded via R’s library() or Python’s import.
2. File paths are assigned to a "*_csv.csv" variable.
3. The column data types are identified ahead of time via a *_col_specs variable.
  - All “numeric” data are read as doubles.
4. Data frames are assigned to a variable upon reading the file.

inspect_script <- function(path) {
  url_base <-  "https://github.com/syknapptic/syknapptic/tree/master/content/post/"
  contents <- read_lines(path)
  cat("File available at", paste0(url_base, path), "\n")
  cat("```\n")
  cat("# ", path, " ", rep("=", (80 - nchar(path) - 2)), "\n", sep = "")
  contents %>% walk(cat, "\n")
  cat("```\n\n")
}

R

“Base” - `utils::read.csv()`

Local R Function

base_col_specs <- c("double", "double", "character",
                    "character", "double", "double",
                    "character", "character", "double",
                    "double", "double", "double")

base_test <- function(path) {
  df <- read.csv(file = path, colClasses = base_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via `system()`

c("r/base_test_small.R", "r/base_test_med.R", "r/base_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_small.R

# r/base_test_small.R ===========================================================
path_small_csv <- "test-data/small_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_small_csv, colClasses = base_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_med.R

# r/base_test_med.R =============================================================
path_medium_csv <- "test-data/medium_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_medium_csv, colClasses = base_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_big.R

# r/base_test_big.R =============================================================
path_big_csv <- "test-data/big_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_big_csv, colClasses = base_col_specs)

`readr::read_csv()`

Local R Function

library(readr)

readr_col_specs <- list(col_double(), col_double(), col_character(),
                        col_character(), col_double(), col_double(),
                        col_character(), col_character(), col_double(),
                        col_double(), col_double(), col_double())

readr_test <- function(path) {
  df <- read_csv(file = path, col_types = readr_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via `system()`

c("r/readr_test_small.R", "r/readr_test_med.R", "r/readr_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_small.R

# r/readr_test_small.R ==========================================================
library(readr) 
 
path_small_csv <- "test-data/small_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_small_csv, col_types = readr_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_med.R

# r/readr_test_med.R ============================================================
library(readr) 
 
path_medium_csv <- "test-data/medium_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_medium_csv, col_types = readr_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_big.R

# r/readr_test_big.R ============================================================
library(readr) 
 
path_big_csv <- "test-data/big_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_big_csv, col_types = readr_col_specs)

`data.table::fread()`

Local R Function

library(data.table)

datatable_col_specs <- c("double", "double", "character",
                         "character", "double", "double",
                         "character", "character", "double",
                         "double", "double", "double")

datatable_test <- function(path) {
  df <- fread(file = path, colClasses = datatable_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via `system()`

c("r/datatable_test_small.R", "r/datatable_test_med.R", "r/datatable_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_small.R

# r/datatable_test_small.R ======================================================
library(data.table) 
 
path_small_csv <- "test-data/small_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_small_csv, colClasses = datatable_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_med.R

# r/datatable_test_med.R ========================================================
library(data.table) 
 
path_medium_csv <- "test-data/medium_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_medium_csv, colClasses = datatable_col_specs)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_big.R

# r/datatable_test_big.R ========================================================
library(data.table) 
 
path_big_csv <- "test-data/big_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_big_csv, colClasses = datatable_col_specs)

Python

`pandas.read_csv()`

Local Python Function

import pandas
path_small_csv = 'test-data/small_csv.csv'
path_medium_csv = 'test-data/medium_csv.csv'
path_big_csv = 'test-data/big_csv.csv'
pandas_col_specs = {
  'GlobalRank':float, 'TldRank':float, 'Domain':str,
  'TLD':str, 'RefSubNets':float, 'RefIPs':float,
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float,
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float
  }
def pandas_test_small():
  df = pandas.read_csv(filepath_or_buffer = path_small_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)
  
def pandas_test_medium():
  df = pandas.read_csv(filepath_or_buffer = path_medium_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)
  
def pandas_test_big():
  df = pandas.read_csv(filepath_or_buffer = path_big_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

Scripts to Source via `system()` and `reticulate::py_run_file(..., convert = FALSE)`

c("py/pandas_test_small.py", "py/pandas_test_med.py", "py/pandas_test_big.py") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_small.py

# py/pandas_test_small.py =======================================================
import pandas 
 
path_small_csv = 'test-data/small_csv.csv' 
   
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_small_csv, 
                     dtype = pandas_col_specs, low_memory = False)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_med.py

# py/pandas_test_med.py =========================================================
import pandas 
 
path_medium_csv = 'test-data/medium_csv.csv' 
 
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_medium_csv, 
                     dtype = pandas_col_specs, low_memory = False)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_big.py

# py/pandas_test_big.py =========================================================
import pandas 
 
path_big_csv = 'test-data/big_csv.csv' 
 
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_big_csv, 
                     dtype = pandas_col_specs, low_memory = False)

`reticulate::py_run_string(..., convert = FALSE)`

py_run_string(
"
import pandas

path_small_csv = 'test-data/small_csv.csv'
path_medium_csv = 'test-data/medium_csv.csv'
path_big_csv = 'test-data/big_csv.csv'

pandas_col_specs = {
  'GlobalRank':float, 'TldRank':float, 'Domain':str,
  'TLD':str, 'RefSubNets':float, 'RefIPs':float,
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float,
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float
  }

def retic_pandas_test_small():
  df = pandas.read_csv(filepath_or_buffer = path_small_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

def retic_pandas_test_medium():
  df = pandas.read_csv(filepath_or_buffer = path_medium_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

def retic_pandas_test_big():
  df = pandas.read_csv(filepath_or_buffer = path_big_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

", convert = FALSE
)

Dependencies Only

c("r/test_load_readr.R", "r/test_load_datatable.R", "py/test_load_pandas.py") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_readr.R

# r/test_load_readr.R ===========================================================
library(readr)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_datatable.R

# r/test_load_datatable.R =======================================================
library(data.table)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/test_load_pandas.py

# py/test_load_pandas.py ========================================================
def square(x): 
  return x**2 
 
def evens(x): 
  y = [] 
  for i in x: 
    if i // 2 == 0: 
      y.append(i) 
  return(y) 
 
 
def ceci_nest_pas_une_pipe(args, *funs): 
  for arg in args: 
    for fun in funs: 
      arg = fun(arg) 
  return arg 
 
ceci_nest_pas_une_pipe([1, 2, 3, 4], even)

The Test

100 iterations were run to provide a reasonable balance between rigor and compute time.

n_iterations <- 100

All the code was tested via the {bench} package and its bench::mark() function. This package was only selected over others as a chance to take it for a test drive.

The convert argument of reticulate::py_run_string() and reticulate::py_run_file() calls is set to FALSE to minimize any handicap.

results <- mark(
  base_test(path_small_csv),
  readr_test(path_small_csv),
  datatable_test(path_small_csv),
  system("Rscript r/base_test_small.R"),
  system("Rscript r/readr_test_small.R"),
  system("Rscript r/datatable_test_small.R"),
  
  py$pandas_test_small(),
  py_run_string("retic_pandas_test_small()", convert = FALSE),
  py_run_file("py/pandas_test_small.py", convert = FALSE),
  system("python py/pandas_test_small.py"),

  base_test(path_medium_csv),
  readr_test(path_medium_csv),
  datatable_test(path_medium_csv),
  system("Rscript r/base_test_med.R"),
  system("Rscript r/readr_test_med.R"),
  system("Rscript r/datatable_test_med.R"),
  
  py$pandas_test_medium(),
  py_run_string("retic_pandas_test_medium()", convert = FALSE),
  py_run_file("py/pandas_test_med.py", convert = FALSE),
  system("python py/pandas_test_med.py"),

  base_test(path_big_csv),
  readr_test(path_big_csv),
  datatable_test(path_big_csv),
  system("Rscript r/base_test_big.R"),
  system("Rscript r/readr_test_big.R"),
  system("Rscript r/datatable_test_big.R"),

  py$pandas_test_big(),
  py_run_string("retic_pandas_test_big()", convert = FALSE),
  py_run_file("py/pandas_test_big.py", convert = FALSE),
  system("python py/pandas_test_big.py"),

  check = FALSE, filter_gc = FALSE, iterations = n_iterations
  )
package_results <- mark(
  system("Rscript r/test_load_readr.R"),
  system("Rscript r/test_load_datatable.R"),
  system("python py/test_load_pandas.py"),
  
  check = FALSE, filter_gc = FALSE, iterations = n_iterations
)

Initial Carpentry

package_results_df <- package_results %>% 
  unnest() %>% 
  mutate(package = case_when(
    str_detect(expression, "datatable") ~ "data.table",
    str_detect(expression, "readr") ~ "readr",
    str_detect(expression, "pandas") ~ "pandas"
  )) %>% 
  mutate(call = case_when(
    package == "data.table" ~ "library(data.table)",
    package == "readr" ~ "library(readr)",
    package == "pandas" ~ "import pandas"
  ))

package_medians_df <- package_results_df %>% 
  rename(median_package = median, min_package = min, max_package = max) %>%
  distinct(package, median_package, min_package, max_package) %>% 
  add_row(median_package = bench_time(0), package = "utils")
all_exprs <- results$expression
system_calls <- all_exprs %>% str_subset("^system\\(")
local_r_fun_calls <- all_exprs %>% str_subset("^(base|readr|datatable)_test\\(")
python_eng_calls <- all_exprs %>% str_subset("^py\\$")
reticulate_calls <- all_exprs %>% str_subset("py_run")
knitr_calls <- c(local_r_fun_calls, python_eng_calls, reticulate_calls)

results_df <- results %>%
  unnest() %>%
  mutate(package = case_when(
    str_detect(expression, "datatable") ~ "data.table",
    str_detect(expression, "readr") ~ "readr",
    str_detect(expression, "pandas") ~ "pandas",
    TRUE ~ "utils"
  )) %>% 
  mutate(call = case_when(
    str_detect(expression, "base") ~ "utils::read.csv()",
    str_detect(expression, "readr") ~ "readr::read_csv()",
    str_detect(expression, "datatable") ~ "data.table::fread()",
    str_detect(expression, "py_run_string") ~ "reticulate::py_run_string()",
    str_detect(expression, "py_run_file") ~ "reticulate::py_run_file()",
    str_detect(expression, "pandas") ~ "pandas.read_csv()"
      ) %>%
      str_pad(max(nchar(.)), side = "right") # enforce left alignment in plots
    ) %>%
  mutate(execution_type = case_when(
    expression %in% system_calls ~ "Sourced Script",
    expression %in% knitr_calls ~ "knitr Engine"
    )) %>%
  mutate(dependency_status = case_when(
    expression %in% system_calls ~ "Dependencies Loaded on Execution (Sourced Script)",
    expression %in% knitr_calls ~ "Dependencies Pre-Loaded")) %>% 
  mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>%
  mutate(file_size = str_extract(expression, "small|med|big")) %>%
  mutate(rows = case_when(
    file_size == "small" ~ small_rows,
    file_size == "med" ~ med_rows,
    file_size == "big" ~ big_rows
    )) %>% 
  left_join(package_medians_df, by = "package")

gg_df <- results_df %>%
  mutate(n_rows = rows) %>% 
  arrange(rows) %>%
  mutate(rows = rows %>%
           comma() %>%
           paste("Rows") %>%
           as_factor()
        ) %>%
  group_by(expression) %>% 
  mutate(med_time = as.numeric(median(time))) %>% 
  ungroup() %>% 
  arrange(desc(med_time)) %>%
  mutate(call = as_factor(call)) %>%
  arrange(desc(lang)) %>%
  mutate(lang = as_factor(lang))

The Results

theme_simple <- function(pnl_ln_col = "black", line_type = "dotted", cap_size = 10,
                         facet = NULL, ...) {
  theme_minimal(15, "serif") +
  theme(legend.title = element_blank(), 
        legend.text = element_text(size = 12),
        legend.position = "top",
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = pnl_ln_col, linetype = line_type),
        legend.key.size = unit(1.5, "lines"), 
        axis.text.y = element_text("mono", face = "bold", hjust = 0, size = 12),
        plot.caption = element_text(size = cap_size),
        ...)
}

prep_lab <- function(lab) {
  lab <- substitute(lab)
  bquote(italic(paste("   ", .(lab), "   ")))
}

t_R <- prep_lab(t[R])
t_Python <- prep_lab(t[Python])
t_import_pandas <- prep_lab(t[Python]~-~max~group("(",t[import~~pandas],")"))

plot_times <- function(df, ...) {
  plot_init <- df %>%
    ggplot(aes(call, time)) +
    stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, trim = FALSE) +
    scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) +
    scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) +
    coord_flip() +
    theme_simple()
    if(length(vars(...))) {
      n_rows <- sort(df$n_rows, decreasing = TRUE)[[1]]
      plot_fin <- plot_init + 
        facet_wrap(vars(...), ncol = 1, scales = "free") +
        labs(x = NULL, y = "Execution Time", 
             title = str_glue("CSV to Data Frame: {comma(n_rows)} Rows"),
             caption = str_glue("{n_iterations} iterations"))
    } else {
      plot_fin <- plot_init + 
        labs(x = NULL, y = "Execution Time", title = "Dependency Load Times",
             caption = str_glue("{n_iterations} iterations")) +
        geom_text(aes(y = median, label = paste("Median Time:", median)), 
                  color = "darkgreen", nudge_x = 0.515)
    }
  
  plot_fin
}

Execution Times

At 100 rows, R is faster, with base R’s utils::read.csv() finishing first.

gg_df %>%
  filter(file_size == "small") %>% 
  plot_times(facet = dependency_status)

At 5,000 rows, R is still faster. In the sourced scripts, pandas.read_csv() has nearly caught up with utils::read.csv(), but data.table::fread() has pulled away.

gg_df %>%
  filter(file_size == "med") %>% 
  plot_times(facet = dependency_status)

At 5,000,000 million rows, we’ve reached the size where time differences would actually be noticeable.

The advantage ofutils::read.csv()’s lack of dependencies has run its course and pandas.read_csv() is faster in nearly every case.

That said, readr::read_csv() is still faster than pandas.read_csv() and, like most R users would expect, data.table::fread() is by far the fastest.

gg_df %>%
  filter(file_size == "big") %>% 
  plot_times(facet = dependency_status)

tl;dr

gg_df %>% 
  mutate(dependency_status = dependency_status %>% 
           str_remove("\\s\\(.*$") %>% 
           str_replace("Loaded on", "Loaded\non")
         ) %>% 
  ggplot(aes(call, time)) +
    stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, 
                  trim = FALSE) +
    scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) +
    scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) +
    coord_flip() +
  theme_simple(pnl_ln_col = "gray") +
  theme(axis.text = element_text(size = 8), strip.text  = element_text(size = 12),
        strip.text.y  = element_text(face = "bold", size = 15),
        panel.background = element_rect(fill = "transparent", size = 0.5)) +
  facet_grid(rows ~ dependency_status, scales = "free", switch = "y", space = "free") +
  labs(x = NULL, y = "Time", title = "R vs Python - CSV to Data Frame",
       caption = "12 columns, 100 iterations each")

Appendices

Dependency Load Times

package_results_df %>%
  mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>% 
  arrange(desc(lang)) %>%
  mutate(lang = as_factor(lang)) %>% 
  plot_times()

gg_df %>% 
  filter(dependency_status == "Dependencies Loaded on Execution (Sourced Script)") %>%  
  filter(file_size == "big") %>% 
  mutate(adjusted_time = if_else(lang == "Python", time - max_package, NA_real_))  %>% 
  rename(original_time = time) %>% 
  gather(time_type, time, original_time, adjusted_time) %>% 
  drop_na(time) %>% 
  mutate(descrip = case_when(
    lang == "R" ~ "Original R Time",
    lang == "Python" & time_type == "original_time" ~ "Original Python Time",
    lang == "Python" & time_type == "adjusted_time" ~ "Adjusted Python Time"
    )) %>% 
  arrange(desc(descrip)) %>% 
  mutate(descrip = as_factor(descrip)) %>% 
  ggplot(aes(call, time, fill = descrip)) +
  stat_ydensity(width = 1, size = 0, color = "transparent", scale = "width", bw = 0.01,
                trim = FALSE) +
  scale_fill_manual(values = c("#165CAA", "#ffde57", "#ff9051"), 
                    labels = c(t_R, t_Python, t_import_pandas)) +
  guides(fill = guide_legend(nrow = 3, label.hjust = 0)) +
  coord_flip() +
  theme_simple() +
  labs(x = NULL, y = "Execution Time",
       title = "Comparing Sourced Scripts with Adjusted Python Times",
       caption = str_glue("CSV to Data Frame: {comma(big_rows)} Rows")
      )

Summary Tables

results_df %>%
  select(rows, lang, execution_type, call, mean, median, `itr/sec`, n_gc, mem_alloc) %>% 
  distinct() %>% 
  arrange(rows, desc(lang)) %>%
  mutate(rows = comma(rows), 
         `itr/sec` = round(`itr/sec`, 2),
         n_gc = ifelse(execution_type == "Sourced Script", "unknown", n_gc),
         mem_alloc = ifelse(execution_type == "Sourced Script", "unknown", mem_alloc)) %>% 
  mutate_at(vars(-c(rows, lang)), 
            funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                              color = ifelse(lang == "R", "#002963", "#809100"))
                )) %>% 
  mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                                   color = ifelse(lang == "R", "#002647", "#809100"))) %>% 
  mutate(n_gc = if_else(str_detect(n_gc, "unknown"), "unknown", n_gc),
         mem_alloc = if_else(str_detect(mem_alloc, "unknown"), "unknown", mem_alloc)) %>%
  rename(garbage_collections = n_gc, language = lang, memory_allocated = mem_alloc) %>%
  rename_all(funs(str_to_title(str_replace(., "_", " ")))) %>%
  kable(caption = "CSV to Data Frame Times", escape = FALSE, digits = 2) %>%
  kable_styling(bootstrap_options = "condensed", font_size = 12) %>% 
  collapse_rows(columns = 1:3, valign = "top")

Table 2: CSV to Data Frame Times
Rows	Language	Execution Type	Call	Mean	Median	Itr/Sec	Garbage Collections	Memory Allocated
100	R	knitr Engine	utils::read.csv()	891.85us	834.96us	1121.26	0	364168
			readr::read_csv()	3.05ms	2.9ms	327.96	0	143680
			data.table::fread()	1.4ms	1.36ms	711.99	0	276896
		Sourced Script	utils::read.csv()	205.24ms	200.39ms	4.87	unknown	unknown
			readr::read_csv()	548.78ms	512.73ms	1.82	unknown	unknown
			data.table::fread()	322.13ms	295.43ms	3.1	unknown	unknown
	Python	knitr Engine	pandas.read_csv()	74.99ms	48.48ms	13.34	9	1298288
			reticulate::py_run_string()	5.02ms	3.75ms	199.04	1	2840
			reticulate::py_run_file()	5.78ms	3.75ms	173.05	0	9504
		Sourced Script	pandas.read_csv()	559.01ms	530.84ms	1.79	unknown	unknown
5,000	R	knitr Engine	utils::read.csv()	18.53ms	18.37ms	53.98	0	1975368
			readr::read_csv()	10.17ms	9.75ms	98.3	0	1621744
			data.table::fread()	5ms	4.73ms	200.02	0	675120
		Sourced Script	utils::read.csv()	221.48ms	216.81ms	4.52	unknown	unknown
			readr::read_csv()	529.75ms	522.03ms	1.89	unknown	unknown
			data.table::fread()	296.08ms	294.28ms	3.38	unknown	unknown
	Python	knitr Engine	pandas.read_csv()	83.11ms	67.52ms	12.03	10	3058232
			reticulate::py_run_string()	24.05ms	24.31ms	41.58	0	2840
			reticulate::py_run_file()	22.14ms	21.67ms	45.16	0	2840
		Sourced Script	pandas.read_csv()	577.44ms	552.81ms	1.73	unknown	unknown
5,000,000	R	knitr Engine	utils::read.csv()	19.55s	19.4s	0.05	168	2052818952
			readr::read_csv()	7.11s	7.08s	0.14	107	1567378576
			data.table::fread()	2.73s	2.6s	0.37	28	665639952
		Sourced Script	utils::read.csv()	23.25s	23.07s	0.04	unknown	unknown
			readr::read_csv()	10.06s	10.05s	0.1	unknown	unknown
			data.table::fread()	3.78s	3.78s	0.26	unknown	unknown
	Python	knitr Engine	pandas.read_csv()	20.23s	19.67s	0.05	23	1921138232
			reticulate::py_run_string()	13.26s	13.25s	0.08	0	2840
			reticulate::py_run_file()	13.26s	13.26s	0.08	0	2840
		Sourced Script	pandas.read_csv()	13.8s	13.77s	0.07	unknown	unknown

package_results_df %>% 
  mutate(lang = if_else(str_detect(expression, "\\.py"), "Python", "R"),
         `itr/sec` = round(`itr/sec`, 2)) %>% 
  select(lang, call, min, mean, median, max, `itr/sec`) %>% 
  distinct() %>% 
  mutate_at(vars(-lang), 
            funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                              color = ifelse(lang == "R", "#002963", "#809100"))
            )) %>% 
  mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                                   color = ifelse(lang == "R", "#002647", "#809100"))) %>% 
  rename(language = lang) %>% 
  rename_all(str_to_title) %>% 
  kable(caption = "Dependency Load Times", escape = FALSE, digits = 2) %>%
  kable_styling(bootstrap_options = "condensed", font_size = 12) %>% 
  collapse_rows(columns = 1, valign = "top")

Table 3: Dependency Load Times
Language	Call	Min	Mean	Median	Max	Itr/Sec
R	library(readr)	506ms	510ms	507ms	552ms	1.96
R	library(data.table)	305ms	308ms	305ms	407ms	3.25
Python	import pandas	569ms	611ms	607ms	911ms	1.64

Environment

IDE

rstudio_info <- rstudioapi::versionInfo() # obtain in interactive session
write_rds(rstudio_info, "test-data/rstudio_info.rds")
read_rds("test-data/rstudio_info.rds") %>% 
  as_tibble() %>% 
  mutate(IDE = "RStudio") %>% 
  select(IDE, mode, version) %>% 
  mutate(version = as.character(version)) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

IDE	mode	version
RStudio	desktop	1.1.453

R

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.11.5     bindrcpp_0.2.2        forcats_0.3.0        
##  [4] stringr_1.3.1         dplyr_0.7.6           purrr_0.2.5          
##  [7] readr_1.1.1           tidyr_0.8.1           tibble_1.4.2.9004    
## [10] ggplot2_3.0.0.9000    tidyverse_1.2.1.9000  scales_0.5.0.9000    
## [13] reticulate_1.9.0.9001 kableExtra_0.9.0      bench_1.0.1          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17      lubridate_1.7.4   lattice_0.20-35  
##  [4] utf8_1.1.4        assertthat_0.2.0  digest_0.6.15    
##  [7] psych_1.8.4       R6_2.2.2          cellranger_1.1.0 
## [10] plyr_1.8.4        evaluate_0.10.1   highr_0.7        
## [13] httr_1.3.1        blogdown_0.7.1    pillar_1.3.0.9000
## [16] rlang_0.2.1       lazyeval_0.2.1    readxl_1.1.0     
## [19] rstudioapi_0.7    Matrix_1.2-14     rmarkdown_1.10.7 
## [22] selectr_0.4-1     foreign_0.8-70    munsell_0.5.0    
## [25] broom_0.4.5       compiler_3.5.1    modelr_0.1.2     
## [28] xfun_0.3          pkgconfig_2.0.1   mnormt_1.5-5     
## [31] htmltools_0.3.6   tidyselect_0.2.4  bookdown_0.7     
## [34] fansi_0.2.3       viridisLite_0.3.0 crayon_1.3.4     
## [37] withr_2.1.2       grid_3.5.1        nlme_3.1-137     
## [40] jsonlite_1.5      gtable_0.2.0      magrittr_1.5     
## [43] cli_1.0.0         stringi_1.2.3     profmem_0.5.0    
## [46] reshape2_1.4.3    xml2_1.2.0        htmldeps_0.1.0   
## [49] tools_3.5.1       glue_1.2.0        hms_0.4.2        
## [52] parallel_3.5.1    yaml_2.1.19       colorspace_1.3-2 
## [55] rvest_0.3.2       knitr_1.20.8      bindr_0.1.1      
## [58] haven_1.1.2

Python

import sys
import numpy
import pandas
print(sys.version)
## 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
print(numpy.__version__)
## 1.14.5
print(pandas.__version__)
## 0.23.1

System

CPU

cat("CPU:\n", system("wmic cpu get name", intern = TRUE)[[2]])
## CPU:
##  Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Memory

ram_df <- system("wmic MEMORYCHIP get BankLabel, Capacity, Speed", intern = TRUE) %>% 
  str_trim() %>% 
  as_tibble() %>% 
  slice(2:3) %>% 
  separate(value, into = c("BankLabel", "Capacity", "Speed"), sep = "\\s{2,}")

ram_df %>% 
  rename_all(str_replace, "L", " L") %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

Bank Label	Capacity	Speed
DIMM A	17179869184	2400
DIMM B	17179869184	2400

ram_df %>% 
  mutate(Capacity = as.numeric(Capacity) / 1e9,
         Speed = as.numeric(Speed)) %>% 
  summarise(`Capacity in GB` = sum(Capacity),
            `Speed in MHz` = unique(Speed)) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)

Capacity in GB	Speed in MHz
34.35974	2400

Storage

cat("SSD:\n", system("wmic diskdrive get Model", intern = TRUE)[[2]])
## SSD:
##  PM951 NVMe SAMSUNG 512GB

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R vs Python

Reproducible Python Environment

The Data

Download and Read Data Set

Quick Inspection

Write the .csv

Small

Medium

Big

The Code

R

“Base” - utils::read.csv()

Local R Function

Scripts to Source by Operating System via system()

readr::read_csv()

Local R Function

Scripts to Source by Operating System via system()

data.table::fread()

Local R Function

Scripts to Source by Operating System via system()

Python

pandas.read_csv()

Local Python Function

Scripts to Source via system() and reticulate::py_run_file(..., convert = FALSE)

reticulate::py_run_string(..., convert = FALSE)

Dependencies Only

The Test

Initial Carpentry

The Results

Execution Times

tl;dr

Appendices

Dependency Load Times

Summary Tables

Environment

IDE

R

Python

System

CPU

Memory

Storage

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

“Base” - `utils::read.csv()`

Scripts to Source by Operating System via `system()`

`readr::read_csv()`

Scripts to Source by Operating System via `system()`

`data.table::fread()`

Scripts to Source by Operating System via `system()`

`pandas.read_csv()`

Scripts to Source via `system()` and `reticulate::py_run_file(..., convert = FALSE)`

`reticulate::py_run_string(..., convert = FALSE)`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)