Dollars and cents: How are you at estimating the total bill?

[This article was first published on Decision Science News » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


When estimating the cost of a bunch of purchases, a useful heuristic is rounding each item to the nearest dollar. (In fact, on US income tax returns, one is allowed to round and not report the cents). If prices were uniformly distributed, the following two heuristics would be equally accurate:

* Rounding each item up or down to the nearest dollar and summing
* Rounding each item down, summing, and adding a dollar for every two line items (or 50 cents per item).

But are prices uniformly distributed? Decision Science News wanted to find out.

Fortunately, our Alma Mater makes publicly available the famous University of Chicago Dominick’s Finer Food Database, which will allow us to answer this question (for a variety of grocery store items at least).

We looked at over 32 million purchases comprising:

* 4.8 million cereal purchases
* 2.2 million cracker purchases
* 1.7 million frozen dinner purchases
* 7.2 million frozen entree purchases (though we’re not sure how they differ from “dinners”)
* 4.1 million grooming product purchases
* 4.3 million juice purchases
* 3.3 million laundry product purchases and
* 4.7 million shampoo purchases

The distribution of their prices can be seen above. But what about the cents? We focus down on them here:

As is plain, there are many “9s prices” — a topic well-studied by our marketing colleagues — and there are more prices above 50 cents then below it. The average number of cents turns out to be 57 (median 59).

In sum (heh), it pays to round properly, though we do think some clever heuristics can exploit the fact that each dollar has on average 57 cents associated with it.

Anyone who wants our trimmed down, 11 meg, version of the Dominick’s database (just these categories and prices) is welcome to it. It can be downloaded here:

Plots are made in the R language for statistical computing with Hadley Wickham’s ggplot2 package. The code is here:

if (!require("ggplot2")) install.packages("ggplot2")
orig = read.csv("prices.tsv.gz", sep = "\t")
orig$cents = orig$Price - floor(orig$Price)
LEN = 1e+06
prices = orig[sample(1:nrow(orig), LEN), ]
prices$cents = round((prices$Price - floor(prices$Price)) *
    100, 0)
p = ggplot(prices, aes(x = Price)) + theme_bw()
p + stat_bin(aes(y = ..density..), binwidth = 0.05,
    geom = "bar", position = "identity") + coord_cartesian(xlim = c(0,
    6.1)) + scale_x_continuous(breaks = seq(0, 6, 0.5)) +
    scale_y_continuous(breaks = seq(1,
    2, 1)) + facet_grid(Item ~ .)
p = ggplot(prices, aes(x = cents)) + theme_bw()
p + stat_bin(aes(y = ..density..), binwidth = 1, geom = "bar",
    position = "identity",right=FALSE) + coord_cartesian(xlim = c(0, 100)) +
    scale_x_continuous(name = "Cents", breaks = seq(0, 100, 10)) +
    facet_grid(Item ~ .)

To leave a comment for the author, please follow the link and comment on their blog: Decision Science News » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)