**Win-Vector Blog » R**, and kindly contributed to R-bloggers)

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development.Again and again I find myself working with Java code like the following.

```
public class SomeBigProject1 {
public static double logStirlingApproximation(final int n) {
return n*(Math.log(n)-1) + 0.5*Math.log(2*Math.PI*n);
}
public static double logFactorial(final int n) {
double r = 0.0;
for(int i=n;i>1;--i) {
r += Math.log(i);
}
return r;
}
public static void main(final String[] args) {
int nbad = 0;
for(int n=1000;n<10000;++n) {
if(Math.abs(logFactorial(n)-logStirlingApproximation(n))>=1.0e-5) {
++nbad;
}
}
System.out.println("nbad: " + nbad);
}
}
```

Imagine that this is some humongous project to use Stirling’s Approximation as a replacement for factorial. All the code up until main is great. But the unfortunate developer has hard-coded an acceptance test into `main()`

. If they run their big project all they get out is:

nbad: 7334

The developer needs to re-code and re-build to diagnose the failure, tweak their acceptance criteria or add more measurements.

I strongly recommend a different work pattern. Instead of bringing criteria into the code, bring the data out:

```
public class SomeBigProject2 {
public static void main(final String[] args) {
System.out.println("n"
+ "\t" + "logFactorial"
+ "\t" + "logStirlingApproximation");
for(int n=1000;n<10000;++n) {
System.out.println(String.valueOf(n)
+ "\t" + SomeBigProject1.logFactorial(n)
+ "\t" + SomeBigProject1.logStirlingApproximation(n));
}
}
}
```

Capture this output in a file named “data.tsv” and both Microsoft Excel and R can open it. Naturally I prefer to use R (so that is what I will demonstrate). To read the results into R you start up an R and type in a command like the following:

> d <- read.table('data.tsv', header=T,sep='\t',quote='',as.is=T, stringsAsFactors=F,comment.char='',allowEscapes=F)

Most of the arguments controlling the style of file R is to expected (what the field separator is, weather to expect escapes and quotes and so on). The settings I suggest here are the “ultra hardened” settings. If you make sure none of your fields have a tab or line-break in them when you print then it is guaranteed R can read the data (not matter what whacky symbols are in it). On the java side that usually means making sure any varying text fields are run through `.replaceAll("\\s+"," ")`

“just in case.” At this point you can already look at your data with the `summary()`

command:

> summary(d)

n logFactorial logStirlingApproximation Min. :1000 Min. : 5912 Min. : 5912 1st Qu.:3250 1st Qu.:23034 1st Qu.:23034 Median :5500 Median :41870 Median :41870 Mean :5500 Mean :42536 Mean :42536 3rd Qu.:7749 3rd Qu.:61653 3rd Qu.:61653 Max. :9999 Max. :82100 Max. :82100

This immediately hints that you should have been thinking in terms of relative error instead of absolute error (since insisting on high absolute accuracy on large results does not always make sense).

You also have access to standard statistical measures of agreement like correlation:

> with(d,cor(logFactorial,logStirlingApproximation))

result: 1

You can see where your failures were:

> library(ggplot2) > d$bad <- with(d,abs(logFactorial-logStirlingApproximation)>=1.0e-5) > ggplot(d) + geom_point(aes(x=n,y=bad))

Yields the graph:

You can see all your failures are in the initial interval. You can then drill in:

> ggplot(d) + geom_point(aes(x=n,y=logFactorial-logStirlingApproximation)) + scale_y_log10()

And here we see some things (that are in general true for Stirling’s approximation):

- It is very accurate.
- It is always an under estimate.
- It gets better as n gets larger.

Essentially by poking around with graphs in R you can figure out the nature of your errors (telling you what to fix) and generate findings that tell you how to fix your criteria (perhaps your code is working- but your test wasn’t sensible). The “dump everything and then use R” technique is also particularly good for generating reports on code timings using either `geom_histogram`

or `geom_density`

.

For example, if we had data with a field `runTimeMS`

then it is a simple one-liner to get plot like the following:

> ggplot(t) + geom_density(aes(x=runTimeMS))

From this graph we can immediately see:

- Most of our run-times are very fast.
- We have a heavy right-tail (evidence of “contagion” or one slow-down causing others, like CPU or IO contention).
- Data is truncated at 100MS (could be something “censoring” the measurement, an exception being thrown or an abort).
- There is a spike at 30MS (something is true and slow for some subset of the data that isn’t present in the majority).

This is a lot more that would be seen in a mean-only or mean and standard deviation summary. We may even being seeings signs of two different bugs (the truncation and the spike).

In all cases the key is to dump a lot of data in machine readable form and then come back to to analyze. This is far more flexible than hoping to code in the right summaries and then further hoping the summaries don’t miss something important (or that you at least get a chance to notice if they do miss something). Being able to do exploratory statistics on dumps from your code (both results and timing) gives you incredible measurement, tuning and debugging powers. The scriptability of R means any later analysis is as easy as cut and paste.

Related posts:

- Automatic Differentiation with Scala
- R examine objects tutorial
- Learn Logistic Regression (and beyond)

**leave a comment**for the author, please follow the link and comment on his blog:

**Win-Vector Blog » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...