Establishing Meaningful Performance Comparisons between R and Python

(This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers)

R vs Python

Performance comparisons between R and Python suck.

Most seem to be run in Jupyter Notebook and many are using Python’s rpy2 library to run poorly optimized R code. I’m not an anti-for() loop Nazi (yes, you can use them effectively in R), but thanks to the base::*apply() family and their beautiful purrr::map*() children, there are usually better solutions.

Unfortunately, some of these comparisons arbitrarily test loops in R where you would never, ever do so.

In a language where vectors serve as the fundamental data structure, it doesn’t make any sense why code like this receives such prominent treatment in seemingly every test…..

normal_distibution <- rnorm(2500)

bad_R <- vector(mode = "numeric", length = length(normal_distibution))

for(i in normal_distibution) {
  bad_R[i] <- normal_distibution[i] * normal_distibution[i]
}

If we had to do something explicitly “loopy”, we’d still probably do something like this…

not_so_good_R <- vapply(normal_distibution, function(x) x^2, numeric(1))

identical(bad_R, not_so_good_R)
## [1] TRUE

… but it’s still taking advantage of the fact that normal_distibution is a homogeneous collection of atomic values: a vector.

all(is.vector(normal_distibution), is.atomic(normal_distibution))
## [1] TRUE

With that in mind, just do this…

good_R <- normal_distibution^2

identical(bad_R, good_R)
## [1] TRUE

In Python, using reticulate here, we can do this in a whole bunch of ways…

py_run_string(
"
normal_distibution_py = r.normal_distibution

py_index_results = [None]*len(normal_distibution_py)
py_append_results = []
py_dict_results = {}

", convert = FALSE)

py_loop_index <- (
"for i in range(len(normal_distibution_py)):
  py_index_results[i] = normal_distibution_py[i]**2
")

py_loop_append <- (
"for i in normal_distibution_py:
  py_append_results.append(i**2)
")

py_loop_dict <- (
"for i in range(len(normal_distibution_py)):
  py_dict_results[i] =  normal_distibution_py[i]**2
")

py_list_comp <- (
"
[x**2 for x in normal_distibution_py]
"
)

… but what runs fastest?

speeds <- mark(
  for(i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal_distibution[i],
  vapply(normal_distibution, function(x) x^2, numeric(1)),
  normal_distibution^2,
  
  py_run_string(py_loop_index, convert = FALSE),
  py_run_string(py_loop_append, convert = FALSE),
  py_run_string(py_loop_dict, convert = FALSE),
  py_run_string(py_list_comp),
  
  check = FALSE, iterations = 100
  ) 
Table 1: “Looping” Comparison
expression mean median
Good R normal_distibution^2 2.02us 1.98us
Python
[x**2 for x in normal_distibution_py]
675.43us 595.75us
Python for i in normal_distibution_py:
py_append_results.append(i**2)
935.9us 864.79us
Python for i in range(len(normal_distibution_py)):
py_dict_results[i] = normal_distibution_py[i]**2
1.2ms 1.11ms
Python for i in range(len(normal_distibution_py)):
py_index_results[i] = normal_distibution_py[i]**2
1.37ms 1.16ms
Not-Good R vapply(normal_distibution, function(x) x^2, numeric(1)) 2.18ms 1.79ms
Bad R for (i in normal_distibution) bad_R[i] <- normal_distibution[i] * normal… 55.08ms 49.69ms

In these conditions and for this task, we can say two things:

  • All the Python solutions are faster than the poorly-optimized R solutions.
  • The optimized R solution is faster than all the Python solutions.

That said, there are issues with this test.

Are we really testing the same thing?

In terms of the exact steps that a computer takes to crunch the numbers? No, but that’s not very realistic or useful.

In terms of reaching a desired result? Ignoring that pure Python list()s are not inherently homogeneous, yes.

py_run_string("py_append_results = []")
py_run_string(py_loop_append)
all.equal(good_R, py$py_append_results)
## [1] TRUE

Is running the Python code through R’s reticulate actually fair?

Is it less fair than running rpy2 in Python? After running all these tests, I’d say that reticulate is fairer.

Is this even a good task to compare performance?

Based on the number of articles including a similar test, you’d almost think so. I don’t entirely agree as that’s a bit reductionist. The R solution is only the variable followed by literally two characters: ^2.

But, I do think it serves as a great example of fundamental differences in the languages.

Considering the above results and simplicity of the good R solution, it illustrates how easily you can place arbitrary handicaps on the R code, which you’ll find in many of these “language war” articles. I hope that’s simply due to ignorant assumptions, but then the author shouldn’t be writing an article claiming authority.

While there are articles that do make a point of notifying the reader that the tests are lacking, some will sell the results as gospel anyways. Others seem to dismiss the merits of rigor entirely.

In a field referred to as “Data Science”, the mountain of articles discussing such poor metrics is concerning. Consider how many newcomers seem to use them when choosing a language in which to invest their time, and often money. (BTW the answer is both, but get great at one before tackling the other).

With that in mind, what would an objective comparison look like?

Here’s a barrage of tests applied to a task that’s both common in practice and common in these “language war” tests: reading a .csv file to a data frame. This is a task for which many articles assert Python’s superiority, despite the evidence here and elsewhere.

However, the real goal is to experiment with methods that can be used to make future tests involving less trivial tasks more objective and thus more useful to everyone.

I also think it’s a cool demonstration of some RStudio and {reticulate} sweetness. I hope it spurs some interest in how awesome a multilingual workflow can be.

If you want to skip a pile of monotonous code, go ahead and jump to the results.

Otherwise, the entire workflow is here to scrutinize…

library(bench)
library(kableExtra); options(knitr.kable.NA = "")
library(scales)
library(tidyverse)

Reproducible Python Environment

library(reticulate)
conda_create("r-py-benchmarks", c("python=3.6", "numpy", "pandas"))
use_condaenv("r-py-benchmarks", required = TRUE)

The Data

The data come from a neutral third-party in the form of .csv, which can be obtained from Majestic Million CSV .

Download and Read Data Set

file_url <- "http://downloads.majestic.com/majestic_million.csv"
temp_file <- tempfile(fileext = ".csv")

download.file(file_url, destfile = temp_file)

test_df <- read_csv(temp_file)

Quick Inspection

glimpse(test_df)
## Observations: 1,000,000
## Variables: 12
## $ GlobalRank      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ TldRank         1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,...
## $ Domain          "google.com", "facebook.com", "youtube.com", "t...
## $ TLD             "com", "com", "com", "com", "com", "com", "org"...
## $ RefSubNets      463232, 451237, 410764, 409068, 303679, 292966,...
## $ RefIPs          2963708, 3046847, 2444016, 2546940, 1139322, 13...
## $ IDN_Domain      "google.com", "facebook.com", "youtube.com", "t...
## $ IDN_TLD         "com", "com", "com", "com", "com", "com", "org"...
## $ PrevGlobalRank  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ PrevTldRank     1, 2, 3, 4, 5, 6, 1, 7, 8, 9, 2, 10, 3, 11, 12,...
## $ PrevRefSubNets  462861, 451086, 410676, 408692, 303296, 292918,...
## $ PrevRefIPs      2966284, 3049605, 2447455, 2549623, 1138675, 13...
test_df %>%
  summarise_all(funs(sum(is.na(.)))) %>% # where the NAs at?
  gather(Variable, NAs) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
Variable NAs
GlobalRank 0
TldRank 0
Domain 0
TLD 0
RefSubNets 0
RefIPs 0
IDN_Domain 0
IDN_TLD 0
PrevGlobalRank 0
PrevTldRank 0
PrevRefSubNets 0
PrevRefIPs 0

Write the .csv

write_csv(test_df, path)

Small

The “small” .csv consists of the first 100 rows.

(small_df <- test_df %>% 
  slice(1:100))
## # A tibble: 100 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##                                
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 90 more rows, and 4 more variables: PrevGlobalRank ,
## #   PrevTldRank , PrevRefSubNets , PrevRefIPs 
(small_rows <- nrow(small_df)) %>% comma() %>% cat("rows")
## 100 rows
path_small_csv <- "test-data/small_csv.csv"
write_csv(small_df, path_small_csv)

Medium

The “medium” .csv consists of the first 5,000 rows.

(medium_df <- test_df %>% 
  slice(1:5000))
## # A tibble: 5,000 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##                                
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 4,990 more rows, and 4 more variables: PrevGlobalRank ,
## #   PrevTldRank , PrevRefSubNets , PrevRefIPs 
(med_rows <- nrow(medium_df)) %>% comma() %>% cat("rows")
## 5,000 rows
path_medium_csv <- "test-data/medium_csv.csv"
write_csv(medium_df, path_medium_csv)

Big

The “big” .csv stacks all 1,000,000 rows five times, creating a 5,000,000 row .csv.

(big_df <- test_df %>% 
  rerun(.n = 5) %>% 
  bind_rows())
## # A tibble: 5,000,000 x 12
##    GlobalRank TldRank Domain TLD   RefSubNets RefIPs IDN_Domain IDN_TLD
##                                
##  1          1       1 googl~ com       463232 2.96e6 google.com com    
##  2          2       2 faceb~ com       451237 3.05e6 facebook.~ com    
##  3          3       3 youtu~ com       410764 2.44e6 youtube.c~ com    
##  4          4       4 twitt~ com       409068 2.55e6 twitter.c~ com    
##  5          5       5 micro~ com       303679 1.14e6 microsoft~ com    
##  6          6       6 linke~ com       292966 1.35e6 linkedin.~ com    
##  7          7       1 wikip~ org       287420 1.24e6 wikipedia~ org    
##  8          8       7 plus.~ com       284103 1.46e6 plus.goog~ com    
##  9          9       8 insta~ com       277145 1.37e6 instagram~ com    
## 10         10       9 apple~ com       276152 1.05e6 apple.com  com    
## # ... with 4,999,990 more rows, and 4 more variables:
## #   PrevGlobalRank , PrevTldRank , PrevRefSubNets ,
## #   PrevRefIPs 
(big_rows <- nrow(big_df)) %>% comma() %>% cat("rows")
## 5,000,000 rows
path_big_csv <- "test-data/big_csv.csv"
write_csv(big_df, path_big_csv)

The Code

The following steps were taken to “standardize” code.

  • R and Python functions:
    1. File paths are assigned to a "*_csv.csv" variable.
    2. The column data types are identified ahead of time via a *_col_specs variable in order to maximize read speed. In future tests, it would be interesting to skip this step.
      • All “numeric” data are read as double via:
        • "double" for utils::read.csv() and data.table::fread()
        • readr::col_double() for readr::read_csv()
        • float for pandas.read_csv()
      • This is to standardize numeric usage as my understanding is that both R’s doubles and Python’s floats are doubles in the underlying C code. It also prevents the need to import numpy in every call to a Python script. If this is incorrect, don’t hesitate to say so.
    3. The function assigns the result to an internal df variable.
    4. The function explicitly return()s the data frame.
  • .R and .py Script Execution:
    • .R scripts are called via system() instead of source() as source() appeared to offer a potentially unfair advantage.
    • Similarly, .py scripts were tested via system(), reticulate::py_run_file(), and reticulate::py_run_string() instead of reticulate::source_python(), to minimize the amount of steps required for execution and minimize potential handicaps.
  • .R and .py Script Code:
    1. Relevant package are loaded via R’s library() or Python’s import.
    2. File paths are assigned to a "*_csv.csv" variable.
    3. The column data types are identified ahead of time via a *_col_specs variable.
      • All “numeric” data are read as doubles.
    4. Data frames are assigned to a variable upon reading the file.
inspect_script <- function(path) {
  url_base <-  "https://github.com/syknapptic/syknapptic/tree/master/content/post/"
  contents <- read_lines(path)
  cat("File available at", paste0(url_base, path), "\n")
  cat("```\n")
  cat("# ", path, " ", rep("=", (80 - nchar(path) - 2)), "\n", sep = "")
  contents %>% walk(cat, "\n")
  cat("```\n\n")
}

R

“Base” – utils::read.csv()

Local R Function

base_col_specs <- c("double", "double", "character",
                    "character", "double", "double",
                    "character", "character", "double",
                    "double", "double", "double")

base_test <- function(path) {
  df <- read.csv(file = path, colClasses = base_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via system()

c("r/base_test_small.R", "r/base_test_med.R", "r/base_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_small.R

# r/base_test_small.R ===========================================================
path_small_csv <- "test-data/small_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_small_csv, colClasses = base_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_med.R

# r/base_test_med.R =============================================================
path_medium_csv <- "test-data/medium_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_medium_csv, colClasses = base_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/base_test_big.R

# r/base_test_big.R =============================================================
path_big_csv <- "test-data/big_csv.csv" 
 
base_col_specs <- c("double", "double", "character", 
                    "character", "double", "double", 
                    "character", "character", "double", 
                    "double", "double", "double") 
 
df <- read.csv(file = path_big_csv, colClasses = base_col_specs) 

readr::read_csv()

Local R Function

library(readr)

readr_col_specs <- list(col_double(), col_double(), col_character(),
                        col_character(), col_double(), col_double(),
                        col_character(), col_character(), col_double(),
                        col_double(), col_double(), col_double())

readr_test <- function(path) {
  df <- read_csv(file = path, col_types = readr_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via system()

c("r/readr_test_small.R", "r/readr_test_med.R", "r/readr_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_small.R

# r/readr_test_small.R ==========================================================
library(readr) 
 
path_small_csv <- "test-data/small_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_small_csv, col_types = readr_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_med.R

# r/readr_test_med.R ============================================================
library(readr) 
 
path_medium_csv <- "test-data/medium_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_medium_csv, col_types = readr_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/readr_test_big.R

# r/readr_test_big.R ============================================================
library(readr) 
 
path_big_csv <- "test-data/big_csv.csv" 
 
readr_col_specs <- list(col_double(), col_double(), col_character(), 
                        col_character(), col_double(), col_double(), 
                        col_character(), col_character(), col_double(), 
                        col_double(), col_double(), col_double()) 
 
df <- read_csv(file = path_big_csv, col_types = readr_col_specs) 

data.table::fread()

Local R Function

library(data.table)

datatable_col_specs <- c("double", "double", "character",
                         "character", "double", "double",
                         "character", "character", "double",
                         "double", "double", "double")

datatable_test <- function(path) {
  df <- fread(file = path, colClasses = datatable_col_specs)
  
  return(df)
}

Scripts to Source by Operating System via system()

c("r/datatable_test_small.R", "r/datatable_test_med.R", "r/datatable_test_big.R") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_small.R

# r/datatable_test_small.R ======================================================
library(data.table) 
 
path_small_csv <- "test-data/small_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_small_csv, colClasses = datatable_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_med.R

# r/datatable_test_med.R ========================================================
library(data.table) 
 
path_medium_csv <- "test-data/medium_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_medium_csv, colClasses = datatable_col_specs) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/datatable_test_big.R

# r/datatable_test_big.R ========================================================
library(data.table) 
 
path_big_csv <- "test-data/big_csv.csv" 
 
datatable_col_specs <- c("double", "double", "character", 
                         "character", "double", "double", 
                         "character", "character", "double", 
                         "double", "double", "double") 
 
df <- fread(file = path_big_csv, colClasses = datatable_col_specs) 

Python

pandas.read_csv()

Local Python Function

import pandas
path_small_csv = 'test-data/small_csv.csv'
path_medium_csv = 'test-data/medium_csv.csv'
path_big_csv = 'test-data/big_csv.csv'
pandas_col_specs = {
  'GlobalRank':float, 'TldRank':float, 'Domain':str,
  'TLD':str, 'RefSubNets':float, 'RefIPs':float,
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float,
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float
  }
def pandas_test_small():
  df = pandas.read_csv(filepath_or_buffer = path_small_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)
  
def pandas_test_medium():
  df = pandas.read_csv(filepath_or_buffer = path_medium_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)
  
def pandas_test_big():
  df = pandas.read_csv(filepath_or_buffer = path_big_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

Scripts to Source via system() and reticulate::py_run_file(..., convert = FALSE)

c("py/pandas_test_small.py", "py/pandas_test_med.py", "py/pandas_test_big.py") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_small.py

# py/pandas_test_small.py =======================================================
import pandas 
 
path_small_csv = 'test-data/small_csv.csv' 
   
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_small_csv, 
                     dtype = pandas_col_specs, low_memory = False) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_med.py

# py/pandas_test_med.py =========================================================
import pandas 
 
path_medium_csv = 'test-data/medium_csv.csv' 
 
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_medium_csv, 
                     dtype = pandas_col_specs, low_memory = False) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/pandas_test_big.py

# py/pandas_test_big.py =========================================================
import pandas 
 
path_big_csv = 'test-data/big_csv.csv' 
 
pandas_col_specs = { 
  'GlobalRank':float, 'TldRank':float, 'Domain':str, 
  'TLD':str, 'RefSubNets':float, 'RefIPs':float, 
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float, 
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float 
  } 
 
df = pandas.read_csv(filepath_or_buffer = path_big_csv, 
                     dtype = pandas_col_specs, low_memory = False) 

reticulate::py_run_string(..., convert = FALSE)

py_run_string(
"
import pandas

path_small_csv = 'test-data/small_csv.csv'
path_medium_csv = 'test-data/medium_csv.csv'
path_big_csv = 'test-data/big_csv.csv'

pandas_col_specs = {
  'GlobalRank':float, 'TldRank':float, 'Domain':str,
  'TLD':str, 'RefSubNets':float, 'RefIPs':float,
  'IDN_Domain':str, 'IDN_TLD':str, 'PrevGlobalRank':float,
  'PrevTldRank':float, 'PrevRefSubNets':float, 'PrevRefIPs':float
  }

def retic_pandas_test_small():
  df = pandas.read_csv(filepath_or_buffer = path_small_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

def retic_pandas_test_medium():
  df = pandas.read_csv(filepath_or_buffer = path_medium_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

def retic_pandas_test_big():
  df = pandas.read_csv(filepath_or_buffer = path_big_csv,
                dtype = pandas_col_specs, low_memory = False)
  return(df)

", convert = FALSE
)

Dependencies Only

c("r/test_load_readr.R", "r/test_load_datatable.R", "py/test_load_pandas.py") %>% 
  walk(inspect_script)

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_readr.R

# r/test_load_readr.R ===========================================================
library(readr) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/r/test_load_datatable.R

# r/test_load_datatable.R =======================================================
library(data.table) 

File available at https://github.com/syknapptic/syknapptic/tree/master/content/post/py/test_load_pandas.py

# py/test_load_pandas.py ========================================================
def square(x): 
  return x**2 
 
def evens(x): 
  y = [] 
  for i in x: 
    if i // 2 == 0: 
      y.append(i) 
  return(y) 
 
 
def ceci_nest_pas_une_pipe(args, *funs): 
  for arg in args: 
    for fun in funs: 
      arg = fun(arg) 
  return arg 
 
ceci_nest_pas_une_pipe([1, 2, 3, 4], even) 

The Test

100 iterations were run to provide a reasonable balance between rigor and compute time.

n_iterations <- 100

All the code was tested via the {bench} package and its bench::mark() function. This package was only selected over others as a chance to take it for a test drive.

The convert argument of reticulate::py_run_string() and reticulate::py_run_file() calls is set to FALSE to minimize any handicap.

results <- mark(
  base_test(path_small_csv),
  readr_test(path_small_csv),
  datatable_test(path_small_csv),
  system("Rscript r/base_test_small.R"),
  system("Rscript r/readr_test_small.R"),
  system("Rscript r/datatable_test_small.R"),
  
  py$pandas_test_small(),
  py_run_string("retic_pandas_test_small()", convert = FALSE),
  py_run_file("py/pandas_test_small.py", convert = FALSE),
  system("python py/pandas_test_small.py"),

  base_test(path_medium_csv),
  readr_test(path_medium_csv),
  datatable_test(path_medium_csv),
  system("Rscript r/base_test_med.R"),
  system("Rscript r/readr_test_med.R"),
  system("Rscript r/datatable_test_med.R"),
  
  py$pandas_test_medium(),
  py_run_string("retic_pandas_test_medium()", convert = FALSE),
  py_run_file("py/pandas_test_med.py", convert = FALSE),
  system("python py/pandas_test_med.py"),

  base_test(path_big_csv),
  readr_test(path_big_csv),
  datatable_test(path_big_csv),
  system("Rscript r/base_test_big.R"),
  system("Rscript r/readr_test_big.R"),
  system("Rscript r/datatable_test_big.R"),

  py$pandas_test_big(),
  py_run_string("retic_pandas_test_big()", convert = FALSE),
  py_run_file("py/pandas_test_big.py", convert = FALSE),
  system("python py/pandas_test_big.py"),

  check = FALSE, filter_gc = FALSE, iterations = n_iterations
  )
package_results <- mark(
  system("Rscript r/test_load_readr.R"),
  system("Rscript r/test_load_datatable.R"),
  system("python py/test_load_pandas.py"),
  
  check = FALSE, filter_gc = FALSE, iterations = n_iterations
)

Initial Carpentry

package_results_df <- package_results %>% 
  unnest() %>% 
  mutate(package = case_when(
    str_detect(expression, "datatable") ~ "data.table",
    str_detect(expression, "readr") ~ "readr",
    str_detect(expression, "pandas") ~ "pandas"
  )) %>% 
  mutate(call = case_when(
    package == "data.table" ~ "library(data.table)",
    package == "readr" ~ "library(readr)",
    package == "pandas" ~ "import pandas"
  ))

package_medians_df <- package_results_df %>% 
  rename(median_package = median, min_package = min, max_package = max) %>%
  distinct(package, median_package, min_package, max_package) %>% 
  add_row(median_package = bench_time(0), package = "utils")
all_exprs <- results$expression
system_calls <- all_exprs %>% str_subset("^system\\(")
local_r_fun_calls <- all_exprs %>% str_subset("^(base|readr|datatable)_test\\(")
python_eng_calls <- all_exprs %>% str_subset("^py\\$")
reticulate_calls <- all_exprs %>% str_subset("py_run")
knitr_calls <- c(local_r_fun_calls, python_eng_calls, reticulate_calls)

results_df <- results %>%
  unnest() %>%
  mutate(package = case_when(
    str_detect(expression, "datatable") ~ "data.table",
    str_detect(expression, "readr") ~ "readr",
    str_detect(expression, "pandas") ~ "pandas",
    TRUE ~ "utils"
  )) %>% 
  mutate(call = case_when(
    str_detect(expression, "base") ~ "utils::read.csv()",
    str_detect(expression, "readr") ~ "readr::read_csv()",
    str_detect(expression, "datatable") ~ "data.table::fread()",
    str_detect(expression, "py_run_string") ~ "reticulate::py_run_string()",
    str_detect(expression, "py_run_file") ~ "reticulate::py_run_file()",
    str_detect(expression, "pandas") ~ "pandas.read_csv()"
      ) %>%
      str_pad(max(nchar(.)), side = "right") # enforce left alignment in plots
    ) %>%
  mutate(execution_type = case_when(
    expression %in% system_calls ~ "Sourced Script",
    expression %in% knitr_calls ~ "knitr Engine"
    )) %>%
  mutate(dependency_status = case_when(
    expression %in% system_calls ~ "Dependencies Loaded on Execution (Sourced Script)",
    expression %in% knitr_calls ~ "Dependencies Pre-Loaded")) %>% 
  mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>%
  mutate(file_size = str_extract(expression, "small|med|big")) %>%
  mutate(rows = case_when(
    file_size == "small" ~ small_rows,
    file_size == "med" ~ med_rows,
    file_size == "big" ~ big_rows
    )) %>% 
  left_join(package_medians_df, by = "package")

gg_df <- results_df %>%
  mutate(n_rows = rows) %>% 
  arrange(rows) %>%
  mutate(rows = rows %>%
           comma() %>%
           paste("Rows") %>%
           as_factor()
        ) %>%
  group_by(expression) %>% 
  mutate(med_time = as.numeric(median(time))) %>% 
  ungroup() %>% 
  arrange(desc(med_time)) %>%
  mutate(call = as_factor(call)) %>%
  arrange(desc(lang)) %>%
  mutate(lang = as_factor(lang))

The Results

theme_simple <- function(pnl_ln_col = "black", line_type = "dotted", cap_size = 10,
                         facet = NULL, ...) {
  theme_minimal(15, "serif") +
  theme(legend.title = element_blank(), 
        legend.text = element_text(size = 12),
        legend.position = "top",
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = pnl_ln_col, linetype = line_type),
        legend.key.size = unit(1.5, "lines"), 
        axis.text.y = element_text("mono", face = "bold", hjust = 0, size = 12),
        plot.caption = element_text(size = cap_size),
        ...)
}

prep_lab <- function(lab) {
  lab <- substitute(lab)
  bquote(italic(paste("   ", .(lab), "   ")))
}

t_R <- prep_lab(t[R])
t_Python <- prep_lab(t[Python])
t_import_pandas <- prep_lab(t[Python]~-~max~group("(",t[import~~pandas],")"))

plot_times <- function(df, ...) {
  plot_init <- df %>%
    ggplot(aes(call, time)) +
    stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, trim = FALSE) +
    scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) +
    scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) +
    coord_flip() +
    theme_simple()
    if(length(vars(...))) {
      n_rows <- sort(df$n_rows, decreasing = TRUE)[[1]]
      plot_fin <- plot_init + 
        facet_wrap(vars(...), ncol = 1, scales = "free") +
        labs(x = NULL, y = "Execution Time", 
             title = str_glue("CSV to Data Frame: {comma(n_rows)} Rows"),
             caption = str_glue("{n_iterations} iterations"))
    } else {
      plot_fin <- plot_init + 
        labs(x = NULL, y = "Execution Time", title = "Dependency Load Times",
             caption = str_glue("{n_iterations} iterations")) +
        geom_text(aes(y = median, label = paste("Median Time:", median)), 
                  color = "darkgreen", nudge_x = 0.515)
    }
  
  plot_fin
}

Execution Times

At 100 rows, R is faster, with base R’s utils::read.csv() finishing first.

gg_df %>%
  filter(file_size == "small") %>% 
  plot_times(facet = dependency_status)

At 5,000 rows, R is still faster. In the sourced scripts, pandas.read_csv() has nearly caught up with utils::read.csv(), but data.table::fread() has pulled away.

gg_df %>%
  filter(file_size == "med") %>% 
  plot_times(facet = dependency_status)

At 5,000,000 million rows, we’ve reached the size where time differences would actually be noticeable.

The advantage ofutils::read.csv()’s lack of dependencies has run its course and pandas.read_csv() is faster in nearly every case.

That said, readr::read_csv() is still faster than pandas.read_csv() and, like most R users would expect, data.table::fread() is by far the fastest.

gg_df %>%
  filter(file_size == "big") %>% 
  plot_times(facet = dependency_status)

tl;dr

gg_df %>% 
  mutate(dependency_status = dependency_status %>% 
           str_remove("\\s\\(.*$") %>% 
           str_replace("Loaded on", "Loaded\non")
         ) %>% 
  ggplot(aes(call, time)) +
    stat_ydensity(aes(fill = lang, color = lang), scale = "width", bw = 0.01, 
                  trim = FALSE) +
    scale_fill_manual(values = c("#165CAA", "#ffde57"), labels = c(t_R, t_Python)) +
    scale_color_manual(values = c("#BFC2C5", "#4584b6"), labels = c(t_R, t_Python)) +
    coord_flip() +
  theme_simple(pnl_ln_col = "gray") +
  theme(axis.text = element_text(size = 8), strip.text  = element_text(size = 12),
        strip.text.y  = element_text(face = "bold", size = 15),
        panel.background = element_rect(fill = "transparent", size = 0.5)) +
  facet_grid(rows ~ dependency_status, scales = "free", switch = "y", space = "free") +
  labs(x = NULL, y = "Time", title = "R vs Python - CSV to Data Frame",
       caption = "12 columns, 100 iterations each")

Appendices

Dependency Load Times

package_results_df %>%
  mutate(lang = if_else(str_detect(expression, "pandas"), "Python", "R")) %>% 
  arrange(desc(lang)) %>%
  mutate(lang = as_factor(lang)) %>% 
  plot_times()

gg_df %>% 
  filter(dependency_status == "Dependencies Loaded on Execution (Sourced Script)") %>%  
  filter(file_size == "big") %>% 
  mutate(adjusted_time = if_else(lang == "Python", time - max_package, NA_real_))  %>% 
  rename(original_time = time) %>% 
  gather(time_type, time, original_time, adjusted_time) %>% 
  drop_na(time) %>% 
  mutate(descrip = case_when(
    lang == "R" ~ "Original R Time",
    lang == "Python" & time_type == "original_time" ~ "Original Python Time",
    lang == "Python" & time_type == "adjusted_time" ~ "Adjusted Python Time"
    )) %>% 
  arrange(desc(descrip)) %>% 
  mutate(descrip = as_factor(descrip)) %>% 
  ggplot(aes(call, time, fill = descrip)) +
  stat_ydensity(width = 1, size = 0, color = "transparent", scale = "width", bw = 0.01,
                trim = FALSE) +
  scale_fill_manual(values = c("#165CAA", "#ffde57", "#ff9051"), 
                    labels = c(t_R, t_Python, t_import_pandas)) +
  guides(fill = guide_legend(nrow = 3, label.hjust = 0)) +
  coord_flip() +
  theme_simple() +
  labs(x = NULL, y = "Execution Time",
       title = "Comparing Sourced Scripts with Adjusted Python Times",
       caption = str_glue("CSV to Data Frame: {comma(big_rows)} Rows")
      )

Summary Tables

results_df %>%
  select(rows, lang, execution_type, call, mean, median, `itr/sec`, n_gc, mem_alloc) %>% 
  distinct() %>% 
  arrange(rows, desc(lang)) %>%
  mutate(rows = comma(rows), 
         `itr/sec` = round(`itr/sec`, 2),
         n_gc = ifelse(execution_type == "Sourced Script", "unknown", n_gc),
         mem_alloc = ifelse(execution_type == "Sourced Script", "unknown", mem_alloc)) %>% 
  mutate_at(vars(-c(rows, lang)), 
            funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                              color = ifelse(lang == "R", "#002963", "#809100"))
                )) %>% 
  mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                                   color = ifelse(lang == "R", "#002647", "#809100"))) %>% 
  mutate(n_gc = if_else(str_detect(n_gc, "unknown"), "unknown", n_gc),
         mem_alloc = if_else(str_detect(mem_alloc, "unknown"), "unknown", mem_alloc)) %>%
  rename(garbage_collections = n_gc, language = lang, memory_allocated = mem_alloc) %>%
  rename_all(funs(str_to_title(str_replace(., "_", " ")))) %>%
  kable(caption = "CSV to Data Frame Times", escape = FALSE, digits = 2) %>%
  kable_styling(bootstrap_options = "condensed", font_size = 12) %>% 
  collapse_rows(columns = 1:3, valign = "top")
Table 2: CSV to Data Frame Times
Rows Language Execution Type Call Mean Median Itr/Sec Garbage Collections Memory Allocated
100 R knitr Engine utils::read.csv() 891.85us 834.96us 1121.26 0 364168
readr::read_csv() 3.05ms 2.9ms 327.96 0 143680
data.table::fread() 1.4ms 1.36ms 711.99 0 276896
Sourced Script utils::read.csv() 205.24ms 200.39ms 4.87 unknown unknown
readr::read_csv() 548.78ms 512.73ms 1.82 unknown unknown
data.table::fread() 322.13ms 295.43ms 3.1 unknown unknown
Python knitr Engine pandas.read_csv() 74.99ms 48.48ms 13.34 9 1298288
reticulate::py_run_string() 5.02ms 3.75ms 199.04 1 2840
reticulate::py_run_file() 5.78ms 3.75ms 173.05 0 9504
Sourced Script pandas.read_csv() 559.01ms 530.84ms 1.79 unknown unknown
5,000 R knitr Engine utils::read.csv() 18.53ms 18.37ms 53.98 0 1975368
readr::read_csv() 10.17ms 9.75ms 98.3 0 1621744
data.table::fread() 5ms 4.73ms 200.02 0 675120
Sourced Script utils::read.csv() 221.48ms 216.81ms 4.52 unknown unknown
readr::read_csv() 529.75ms 522.03ms 1.89 unknown unknown
data.table::fread() 296.08ms 294.28ms 3.38 unknown unknown
Python knitr Engine pandas.read_csv() 83.11ms 67.52ms 12.03 10 3058232
reticulate::py_run_string() 24.05ms 24.31ms 41.58 0 2840
reticulate::py_run_file() 22.14ms 21.67ms 45.16 0 2840
Sourced Script pandas.read_csv() 577.44ms 552.81ms 1.73 unknown unknown
5,000,000 R knitr Engine utils::read.csv() 19.55s 19.4s 0.05 168 2052818952
readr::read_csv() 7.11s 7.08s 0.14 107 1567378576
data.table::fread() 2.73s 2.6s 0.37 28 665639952
Sourced Script utils::read.csv() 23.25s 23.07s 0.04 unknown unknown
readr::read_csv() 10.06s 10.05s 0.1 unknown unknown
data.table::fread() 3.78s 3.78s 0.26 unknown unknown
Python knitr Engine pandas.read_csv() 20.23s 19.67s 0.05 23 1921138232
reticulate::py_run_string() 13.26s 13.25s 0.08 0 2840
reticulate::py_run_file() 13.26s 13.26s 0.08 0 2840
Sourced Script pandas.read_csv() 13.8s 13.77s 0.07 unknown unknown
package_results_df %>% 
  mutate(lang = if_else(str_detect(expression, "\\.py"), "Python", "R"),
         `itr/sec` = round(`itr/sec`, 2)) %>% 
  select(lang, call, min, mean, median, max, `itr/sec`) %>% 
  distinct() %>% 
  mutate_at(vars(-lang), 
            funs(cell_spec(., background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                              color = ifelse(lang == "R", "#002963", "#809100"))
            )) %>% 
  mutate(lang = lang %>% cell_spec(background = ifelse(lang == "R", "#f2f2f2", "#edf9ff"),
                                   color = ifelse(lang == "R", "#002647", "#809100"))) %>% 
  rename(language = lang) %>% 
  rename_all(str_to_title) %>% 
  kable(caption = "Dependency Load Times", escape = FALSE, digits = 2) %>%
  kable_styling(bootstrap_options = "condensed", font_size = 12) %>% 
  collapse_rows(columns = 1, valign = "top")
Table 3: Dependency Load Times
Language Call Min Mean Median Max Itr/Sec
R library(readr) 506ms 510ms 507ms 552ms 1.96
library(data.table) 305ms 308ms 305ms 407ms 3.25
Python import pandas 569ms 611ms 607ms 911ms 1.64

Environment

IDE

rstudio_info <- rstudioapi::versionInfo() # obtain in interactive session
write_rds(rstudio_info, "test-data/rstudio_info.rds")
read_rds("test-data/rstudio_info.rds") %>% 
  as_tibble() %>% 
  mutate(IDE = "RStudio") %>% 
  select(IDE, mode, version) %>% 
  mutate(version = as.character(version)) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
IDE mode version
RStudio desktop 1.1.453

R

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.11.5     bindrcpp_0.2.2        forcats_0.3.0        
##  [4] stringr_1.3.1         dplyr_0.7.6           purrr_0.2.5          
##  [7] readr_1.1.1           tidyr_0.8.1           tibble_1.4.2.9004    
## [10] ggplot2_3.0.0.9000    tidyverse_1.2.1.9000  scales_0.5.0.9000    
## [13] reticulate_1.9.0.9001 kableExtra_0.9.0      bench_1.0.1          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17      lubridate_1.7.4   lattice_0.20-35  
##  [4] utf8_1.1.4        assertthat_0.2.0  digest_0.6.15    
##  [7] psych_1.8.4       R6_2.2.2          cellranger_1.1.0 
## [10] plyr_1.8.4        evaluate_0.10.1   highr_0.7        
## [13] httr_1.3.1        blogdown_0.7.1    pillar_1.3.0.9000
## [16] rlang_0.2.1       lazyeval_0.2.1    readxl_1.1.0     
## [19] rstudioapi_0.7    Matrix_1.2-14     rmarkdown_1.10.7 
## [22] selectr_0.4-1     foreign_0.8-70    munsell_0.5.0    
## [25] broom_0.4.5       compiler_3.5.1    modelr_0.1.2     
## [28] xfun_0.3          pkgconfig_2.0.1   mnormt_1.5-5     
## [31] htmltools_0.3.6   tidyselect_0.2.4  bookdown_0.7     
## [34] fansi_0.2.3       viridisLite_0.3.0 crayon_1.3.4     
## [37] withr_2.1.2       grid_3.5.1        nlme_3.1-137     
## [40] jsonlite_1.5      gtable_0.2.0      magrittr_1.5     
## [43] cli_1.0.0         stringi_1.2.3     profmem_0.5.0    
## [46] reshape2_1.4.3    xml2_1.2.0        htmldeps_0.1.0   
## [49] tools_3.5.1       glue_1.2.0        hms_0.4.2        
## [52] parallel_3.5.1    yaml_2.1.19       colorspace_1.3-2 
## [55] rvest_0.3.2       knitr_1.20.8      bindr_0.1.1      
## [58] haven_1.1.2

Python

import sys
import numpy
import pandas
print(sys.version)
## 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
print(numpy.__version__)
## 1.14.5
print(pandas.__version__)
## 0.23.1

System

CPU

cat("CPU:\n", system("wmic cpu get name", intern = TRUE)[[2]])
## CPU:
##  Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz  

Memory

ram_df <- system("wmic MEMORYCHIP get BankLabel, Capacity, Speed", intern = TRUE) %>% 
  str_trim() %>% 
  as_tibble() %>% 
  slice(2:3) %>% 
  separate(value, into = c("BankLabel", "Capacity", "Speed"), sep = "\\s{2,}")

ram_df %>% 
  rename_all(str_replace, "L", " L") %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
Bank Label Capacity Speed
DIMM A 17179869184 2400
DIMM B 17179869184 2400
ram_df %>% 
  mutate(Capacity = as.numeric(Capacity) / 1e9,
         Speed = as.numeric(Speed)) %>% 
  summarise(`Capacity in GB` = sum(Capacity),
            `Speed in MHz` = unique(Speed)) %>% 
  kable() %>% 
  kable_styling(full_width = FALSE)
Capacity in GB Speed in MHz
34.35974 2400

Storage

cat("SSD:\n", system("wmic diskdrive get Model", intern = TRUE)[[2]])
## SSD:
##  PM951 NVMe SAMSUNG 512GB  

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)