Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

MANY THINGS FOLLOW BENFORD’S LAW. CENTS DON’T.

A commenter on our last post brought up Benford’s law, the idea that naturally occurring numbers follow a predictable pattern. Does Benford’s law apply to the cent amounts taken from our 32 million grocery store prices?

Benford’s law, if you don’t know about it, is an amazing thing. If you know the probability distribution that “natural” numbers should have, you can detect where people might be faking data: phony tax returns, bogus scientific studies, etc.

In Benford’s law, the probability of the leftmost digit of a number being d is

log10(d + 1) – log10(d)

In the chart above, we plot the predictions of Benford’s law in red points. The distribution of leftmost digits in our cents data is shown in the blue bars.

As the chart makes plain, cents after the dollar do not seem distributed in a Benford-like manner. And no one is “faking” cents data. It is simply that Benford’s is a domain specific heuristic. Grocery store prices seem to be chosen strategically. Look at all those nines. Perhaps dollar (as opposed to cent) prices in the grocery store are distributed according to Benford, but we leave that as an exercise for the reader.

Want to play with it yourself? The R / ggplot2 code that made this plot is below, and the anyone who wants our trimmed down, 12 meg, version of the Dominick’s database (just these categories and prices) is welcome to it. It can be downloaded here: http://dangoldstein.com/flash/prices/.

if (!require("ggplot2")) install.packages("ggplot2")

orig = read.csv("prices.tsv.gz", sep = "\t")
orig$cents = round(100 * orig$Price)%%100
# First digit
orig$fd = with(orig, ifelse(cents < 10, round(cents), cents%/%10)) sumif = function(x) {sum(orig$fd == x)}
vsumif = Vectorize(sumif)
df = data.frame(Numeral = c(1:9), count = vsumif(1:9))
df$Probability = df$count/sum(df$count) ben = function(d) {log10(d + 1) - log10(d)} df$benford = ben(1:9)
p = ggplot(df, aes(x = Numeral, y = Probability)) + theme_bw()
p = p + geom_bar(stat = "identity", fill = I("blue"))
p = p + scale_x_continuous(breaks = seq(1:9))
p = p + geom_line(aes(x = Numeral, y = benford, size = 0.1))
p = p + geom_point(aes(x = Numeral, y = benford, color = "red", size = 1))
p + opts(legend.position = "none")
ggsave("benford.png")


Plyr / ggplot author Hadley Wickham submitted some much prettier Hadleyfied code, which I studied. It wasn't quite counting the same thing my code counts, so I modified it so that they do the same thing. Most of the difference had to do with rounding. The result is here:

orig2 <- read.csv("prices.tsv.gz", sep = "\t")
orig2 <- mutate(orig2,
cents = round(100 * Price) %% 100,
fd = ifelse(cents

Playing with system.time() if found that the "mutate" command in plyr is slightly faster than my two calls to modify "orig" and much more compact. Lesson learned: Use mutate over separate calls and over "tranform".

My application of calls to sum(orig$fd==x) is faster than Hadley's count(orig2, "fd"), however "count" is much more general purpose and faster than "table" according to my tests, so I will be using it in the future. ADDENDUM 2: Big Data colleague Jake Hofman proves that awk can do this much faster than the R code above: zcat prices.tsv.gz | awk 'NR > 1 {cents=int(100*$2+0.5) % 100;
counts[substr(cents,1,1)]++} END {for (k in counts) print k,
counts[k];}' > benford.dat