# Programmers Should Know R

August 6, 2011
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following.

public class SomeBigProject1 {
public static double logStirlingApproximation(final int n) {
return n*(Math.log(n)-1) + 0.5*Math.log(2*Math.PI*n);
}

public static double logFactorial(final int n) {
double r = 0.0;
for(int i=n;i>1;--i) {
r += Math.log(i);
}
return r;
}

public static void main(final String[] args) {
for(int n=1000;n<10000;++n) {
if(Math.abs(logFactorial(n)-logStirlingApproximation(n))>=1.0e-5) {
}
}
}
}


Imagine that this is some humongous project to use Stirling’s Approximation as a replacement for factorial. All the code up until main is great. But the unfortunate developer has hard-coded an acceptance test into main(). If they run their big project all they get out is:

nbad: 7334


The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.

I strongly recommend a different work pattern. Instead of bringing criteria into the code, bring the data out:

public class SomeBigProject2 {
public static void main(final String[] args) {
System.out.println("n"
+ "\t" + "logFactorial"
+ "\t" + "logStirlingApproximation");
for(int n=1000;n<10000;++n) {
System.out.println(String.valueOf(n)
+ "\t" + SomeBigProject1.logFactorial(n)
+ "\t" + SomeBigProject1.logStirlingApproximation(n));
}
}
}


Capture this output in a file named “data.tsv” and both Microsoft Excel and R can open it. Naturally I prefer to use R (so that is what I will demonstrate). To read the results into R you start up an R and type in a command like the following:

 > d <- read.table('data.tsv',
stringsAsFactors=F,comment.char='',allowEscapes=F)


Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on). The settings I suggest here are the “ultra hardened” settings. If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it). On the java side that usually means making sure any varying text fields are run through .replaceAll("\\s+"," ") “just in case.” At this point you can already look at your data with the summary() command:

 > summary(d)

       n         logFactorial   logStirlingApproximation
Min.   :1000   Min.   : 5912   Min.   : 5912
1st Qu.:3250   1st Qu.:23034   1st Qu.:23034
Median :5500   Median :41870   Median :41870
Mean   :5500   Mean   :42536   Mean   :42536
3rd Qu.:7749   3rd Qu.:61653   3rd Qu.:61653
Max.   :9999   Max.   :82100   Max.   :82100


This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).

You also have access to standard statistical measures of agreement like correlation:

 > with(d,cor(logFactorial,logStirlingApproximation))

result: 1


You can see where your failures were:

 > library(ggplot2)


Yields the graph:

You can see all your failures are in the initial interval. You can then drill in:

 > ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation))
+ scale_y_log10()


And here we see some things (that are in general true for Stirling’s approximation):

1. It is very accurate.
2. It is always an under estimate.
3. It gets better as n gets larger.

Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn’t sensible). The “dump everything and then use R” technique is also particularly good for generating reports on code timings using either geom_histogram or geom_density.

For example, if we had data with a field runTimeMS then it is a simple one-liner to get plot like the following:

 > ggplot(t) + geom_density(aes(x=runTimeMS))


From this graph we can immediately see:

1. Most of our run-times are very fast.
2. We have a heavy right-tail (evidence of “contagion” or one slow-down causing others, like CPU or IO contention).
3. Data is truncated at 100MS (could be something “censoring” the measurement, an exception being thrown or an abort).
4. There is a spike at 30MS (something is true and slow for some subset of the data that isn’t present in the majority).

This is a lot more that would be seen in a mean-only or mean and standard deviation summary. We may even being seeings signs of two different bugs (the truncation and the spike).

In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze. This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don’t miss something important (or that you at least get a chance to notice if they do miss something). Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers. The scriptability of R means any later analysis is as easy as cut and paste.

Related posts:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...