April 20, 2011
By

(This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers)

For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.

YX1X2X3X4
5.881111
2.566111
11.111111
0.796111
0.006111
0.000111
15.68111
3.74111
8.493111
51.26111
14.27111
7.145111
4.27111
6.154111
10.466111
0.008111
10.422111
17.365111
13.418111
41.670111
2.780111
2.988111
9.627111
0.000111
4.655102
3.133102
24.586102
0.001102
5.564102
9.263102
0.000102
0.000102
3.131102
0.000102
7.565013
9.936013
0.008013
16.676013
16.897013
13.716013
6.355013
2.53013
2.477013
21.743013
23.68004
11.118004
0.007004
3.578004
2.95004
2.943004
2.428004
18.754004
0.005004
2.273004
Spreadsheets are good for some things, but analysing data is not one of them. The example data in the table on the right is from Jeffrey S. Simonoff, “Statistical analysis using Microsoft Excel” (2008), and looks at first (and maybe even second) glance like a reasonable set of observations.

However, the predictors are (accidentally) collinear so no meaningful fit is possible, unless one of them are dropped. We see that very easily if we try to do the analysis using the R statistical computing and analysis platform:

> d <- read.delim("clipboard")  # Read DATA range from clipboard
> summary(lm(Y ~ ., data = d))

Call:
lm(formula = Y ~ ., data = d)

Residuals:
Min      1Q  Median      3Q     Max
-11.222  -5.821  -2.546   3.171  40.750

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.1945     3.9749   1.055    0.296
X1            0.3862     0.5652   0.683    0.497
X2            0.2308     3.1590   0.073    0.942
X3            3.7072     2.9922   1.239    0.221
X4                NA         NA      NA       NA

Residual standard error: 10.14 on 50 degrees of freedom
Multiple R-squared: 0.04767,	Adjusted R-squared: -0.009466
F-statistic: 0.8343 on 3 and 50 DF,  p-value: 0.4814


We have highlighted the message that R has automatically dropped one of the predictors.

Everybody likes to pick on Excel, so let us load the data into version 3.3.2 of LibreOffice, the free Open Source personal productivity suite, instead. It faithfully implements many of the worst features of Excel. You can grab a copy of the spreadsheet GS-spreadsheet-error.ods yourself and see the results. The relevant function in both Excel and LibreOffice for linear regression is LINEST and applying it to the data set give us:

Of the 16 values returned by the function, fully 12 of them are incorrect (highlighted in red), and the '#VALUE!' entries are the only thing that suggests we may have a problem. (The '#N/A' values are a feature of the function and not a problem.) Excluding the X4 values from the function call gives meaningful (and correct) results:

There is so much wrong with doing even this trivial analysis in a spreadsheet that it is hard to know where to start. Some of the problems:

Instead of giving meaningful errors or warnings, the spreadsheets simply produce garbage results. This is nearly impossible to debug.
No help on how to correct the problem
In the erroneous results of the first figure, there is no clue, no hint, no help to figure out how to correct the problem. You could argue about R correcting the issue ”automagically”, but at least it finds a solution to the problem and tells you about it.
Error prone output formats
I put in the row and column headings because otherwise it is just too hard to read the data. Where does the function stuff the F statistics again?

And don’t get me started on version control and documentation. Don’t even mention that the maths in Excel are wrong. Remember: Friends do not let friends do data analysis in spreadsheets.

# You may also like these posts:

1. I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below.

2. Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here.

3. Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.

4. We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…

5. We are seeing the same thing, if a little less and a little delayed. Does it have to be like this? I dont think it is just the tech industry but any new and hot growth area. Fred Wilson writes in Bubble 2.0 that we are heading for a new bubble, similar to…