Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some weeks ago I tweeted this:

you might think that I tweeted this as an unfunny joke, but it’s not. GUIs were one of the worst things to happen for statisticians. Clickable interfaces for data analysis is probably one of the greatest source of mistakes and errors in data processing, very likely costing many millions to companies worldwide and is a source of constant embarassment when mistakes happen which cost the reputation, and money, of institutions or people.

Remember the infamous Excel mistake by Reinhard and Rogoff? If you don’t know what I’m talking about, you can get up to speed by reading this. I think the most interesting sentence is this:

The most serious was that, in their Excel spreadsheet, Reinhart and Rogoff had not selected the entire row when averaging growth figures: they omitted data from Australia, Austria, Belgium, Canada and Denmark.

This is a typical mistake that happens when a mouse is used to select data in a GUI, instead of typing whatever you need in a scripting language. Many other mistakes like that happen, and they remain hidden, potentially for years, or go unreported.

Recently there was another Excel-related problem in England where positive Covid tests got lost. For some obscure reason, the raw data, which was encoded in a CSV file, got converted into an Excel spreadsheet, most likely for further analysis. The problem is that the format that was used was the now obsolete XLS format, instead of the latest XLSX format, which can handle millions of rows. Because the data was converted in the XLS format, up to 15841 cases were lost. You can get all the details from this BBC article. Again, not entirely Excel’s fault, as it was misused. The problem is that when all you have is a hammer, everything looks like a nail, and Excel is that data analytics hammer. So to the uncultured, everything looks like an Excel problem.

Now how much are situations like that the fault of Excel and how much personal responsibility do the people involved have? I don’t know, but my point is that if, by magic, GUIs were made to disappear, problems like that would also not exist. The reason is that if you’re forced to write code to reach the results you want, you avoid a lot of these pitfalls I just described. Working with scripts and the command line forces a discipline unto you; you cannot be lazy and click around. For example, reverse engineering a source code file is much easier that a finished Excel spreadsheet. Even poorly written and undocumented code is always much better than an Excel spreadsheet. If you throw a version control system in the mix, you have the whole history of the file and the ability to know exactly what happened and when. Add unit tests on the pile, and you start to get something that is very robust, transparent, and much easier to audit.

“But Bruno, not everyone is a programmer!” I hear you scream at your monitor.

My point, again, is that if GUIs did not exist, people would have enough knowledge of these tools to be able to work. What other choice would they have?

Of course, GUIs have been invented, and they’re going nowhere. So what can you do?

When it comes to statistics and data analysis/processing, you can at least not be part of the problem and avoid using Excel altogether. If we go back to our previous scenario from the financial advisory firm, the first step, which consisted in reverse engineering an Excel file, can be done using {tidyxl}. Let’s take a quick look; the spreadsheet I used as the header image for this blog post comes from the Enron corpus , which is mostly know for being a database of over 600000 emails from the US company Enron. But it also contains spreadsheets, which are delightful. You can download the one from the picture here (8mb xlsx warning). Opening it in your usual spreadsheet application will probably cause your heart rate to increase to dangerous levels, so avoid that. Instead, let’s take a look at what {tidyxl} does with it:

library(tidyxl)
## Warning: package 'tidyxl' was built under R version 4.0.3
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.3.2     ✔ purrr   0.3.4
## ✔ tibble  3.0.1     ✔ dplyr   1.0.0
## ✔ tidyr   1.1.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

## # A tibble: 6 x 21
##   sheet address   row   col is_blank data_type error logical numeric
##
## 1 Swap… A1          1     1 FALSE    character   NA           NA
## 2 Swap… D2          2     4 FALSE    character   NA           NA
## 3 Swap… E2          2     5 FALSE    character   NA           NA
## 4 Swap… F2          2     6 FALSE    character   NA           NA
## 5 Swap… G2          2     7 FALSE    character   NA           NA
## 6 Swap… D3          3     4 FALSE    character   NA           NA
## # … with 12 more variables: date , character ,
## #   character_formatted , formula , is_array ,
## #   formula_ref , formula_group , comment , height ,
## #   width , style_format , local_format_id 

That whole Excel workbook is inside a neat data frame. Imagine that you want to quickly know where all the formulas are:

dutch_quigley_9378 %>%
filter(!is.na(formula)) %>%
## # A tibble: 18,776 x 3
##
##  1 Front B22         1
##  2 Front C13         1
##  3 Front C2          1
##  4 Front C22         1
##  5 Front C25         1
##  6 Front C26         1
##  7 Front C27         1
##  8 Front C28         1
##  9 Front C30         1
## 10 Front C31         1
## # … with 18,766 more rows

With the code above, you can quickly find, for each sheet, where the formulas are. This workbook contains 18776 formulas. If Hell is a real place, it’s probably an office building full of cubicles where you’ll sit for eternity looking at these spreadsheets and trying to make sense of them.

Now imagine that you’d like to know what these formulas are, let’s say, for the Swap sheet:

dutch_quigley_9378 %>%
filter(sheet == "Swap", !is.na(formula)) %>%
## # A tibble: 6,773 x 2
##
##  1 F1      DAY(EOMONTH(G1,0))
##  2 G1      A11
##  3 E2      BE9
##  4 A3      BQ5
##  5 E3      BF9
##  6 F3      SUM(G3:K3)
##  7 H3      $F$1*H2
##  8 I3      $F$1*I2
##  9 J3      $F$1*J2
## 10 K3      $F$1*K2
## # … with 6,763 more rows

Brilliant! Maybe you’re interested to find all the "SUM" formulas? Easy!

dutch_quigley_9378 %>%
filter(sheet == "Swap", !is.na(formula)) %>%
filter(grepl("SUM", formula)) %>%
## # A tibble: 31 x 2
##
##  1 F3      SUM(G3:K3)
##  2 E4      SUM(D11:D309)
##  3 F5      SUM(G5:K5)
##  4 E6      SUM(F6:H6)
##  5 BF8     SUM(BF11:BF242)
##  6 B9      SUM(B47:B294)
##  7 AB9     SUM(AB11:AB253)
##  8 AC9     SUM(AC11:AC253)
## 10 AE9     SUM(AE11:AE253)
## # … with 21 more rows

You get the idea. There are many more things that you can extract such as the formatting, the contents of the cells, the comments (and where to find them) and much, much more. This will make making sense of a complex Excel file a breeze.