The importance of being unoriginal (and befriending google)

[This article was first published on Life in Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I’ve never had any reason to go under the hood until now. Lately, I’ve been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call(“bincounts”), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    };
    return(cbind(bin=bb, dens=aa/length(x)))
}


## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don’t understand. On the other hand, trunc(x) doesn’t take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid’s red wagon with a landrover…

Dear Google…

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn’t return the bin boundaries by default, but it’s good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there’s something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn’t quite as simple as one first thought. On the other hand, sometimes we just want answers. When that’s the case, asking google is a pretty good bet.

## their method
require(ash)
get2 = bin1(test, c(0,1), 1e3+1)$nc

To leave a comment for the author, please follow the link and comment on their blog: Life in Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)