Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Trick question: is a `10,000` cell numeric `data.frame` big or small?

In the era of “big data” `10,000` cells is minuscule. Such data could be fit on fewer than `1,000` punched cards (or less than half a box). The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

# Example

Let’s look at a few examples in `R`. First let’s set up our examples. A `10,000` row by one column `data.frame` (probably fairly close the common mental model of a `100,000` cell `data.frame`), and a `100,000` column by one row `data.frame` (frankly bit of an abuse, but data warehouse tables with millions of rows and `500` to `1,000` columns are not uncommon).

```dTall <- as.data.frame(matrix(data = 0.0,
nrow = 10000,
ncol = 1))

dWide <- as.data.frame(matrix(data = 0.0,
nrow = 1,
ncol = 10000))```

For our example problem we will try to select (zero) rows based on a condition written against the first column.

# Base R

For standard `R` working with either `data.frame` is not a problem.

`system.time(nrow(dTall[dTall\$V1>0, , drop = FALSE]))`
```##    user  system elapsed
##       0       0       0```
`system.time(nrow(dWide[dWide\$V1>0, , drop = FALSE]))`
```##    user  system elapsed
##   0.060   0.004   0.064```

# `dplyr`

For `dplyr` the tall frame is no problem, but the wide frame can take almost 5 minutes to filter.

`library("dplyr")`
```##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union```
`system.time(dTall %>% filter(V1>0) %>% tally())`
```##    user  system elapsed
##   0.059   0.003   0.061```
`system.time(dWide %>% filter(V1>0) %>% tally())`
```##    user  system elapsed
##   2.224   0.087   2.320```

We will dig deeper into the `dplyr` timing on the wide table later.

# Databases

Most databases don’t really like to work with a ridiculous number of columns.

## RSQLite

`RSQLite` refuses to worm with the wide frame.

```db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")```
```DBI::dbWriteTable(db, "dTall", dTall,
overwrite = TRUE,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
overwrite = TRUE,
temporary = TRUE)```
`## Error in rsqlite_send_query(conn@ptr, statement): too many columns on dWide`
`DBI::dbDisconnect(db)`

## RPostgres

`RPostgres` refuses the wide frame, stating a hard limit of `1600` columns.

```db <- DBI::dbConnect(RPostgres::Postgres(),
host = 'localhost',
port = 5432,
user = 'postgres',
```DBI::dbWriteTable(db, "dTall", dTall,
overwrite = TRUE,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
overwrite = TRUE,
temporary = TRUE)```
`## Error in result_create(conn@ptr, statement): Failed to fetch row: ERROR:  tables can have at most 1600 columns`
`DBI::dbDisconnect(db)`

## `sparklyr`

`sparklyr` fails, losing the cluster connection when attempting to write the wide frame.

```spark <- sparklyr::spark_connect(version='2.2.0',
master = "local")```
```DBI::dbWriteTable(spark, "dTall", dTall,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
temporary = TRUE)```
`## Error in connection_quote_identifier(conn@ptr, x): Invalid connection`
`sparklyr::spark_disconnect(spark)`

# Why I care

Some clients have run into intermittent issues on `Spark` at around 700 columns. One step of working around the issue was trying a range of sizes to try and figure out where the issue was and get a repeatable failure ( always an important step in debugging).

# Extra: `dplyr` again at larger scale.

Let’s look a bit more closely at that `dplyr` run-time. We will try to get the nature of the column dependency by pushing the column count ever further up: to `100,000`.

This is still less than a megabyte of data. It can fit on a 1986 era `1.44 MB` floppy disk. ```dWide <- as.data.frame(matrix(data = 0.0,
nrow = 1,
ncol = 100000))

dwt <- system.time(dWide %>% filter(V1>0) %>% tally())
print(dwt)```
```##    user  system elapsed
## 251.441  28.067 283.060```

## Python

For comparison we can measure how long it would take to write the results out to disk, start up a Python interpreter, use Pandas do do the work, and then read the result back in to `R`.

```start_pandas <- Sys.time()
feather::write_feather(dWide, "df.feather")```
```import pandas
import feather
print(type(df))```
`## <class 'pandas.core.frame.DataFrame'>`
`print(df.shape)`
`## (1, 100000)`
```df_filtered = df.query('V1>1')
feather.write_dataframe(df_filtered, 'dr.feather')```
```res <- feather::read_feather('dr.feather')
nrow(res)```
`##  0`
```end_pandas <- Sys.time()
python_duration <- difftime(end_pandas, start_pandas,
unit = "secs")
print(python_duration)```
`## Time difference of 21.47297 secs`
```ratio <- as.numeric(dwt['elapsed'])/as.numeric(python_duration)
print(ratio)```
`##  13.18216`

This is slow, but still 13.2 times faster than using `dplyr`.