Dial-a-statistic! Featuring R and Estonia

January 16, 2011

(This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to R-bloggers)

Did you wake up this morning hoping that you would be able to listen to telephone beeps inspired by Estonian web site metrics? I knew you did!

First things first: I came up with the slightly crazy idea of using the bleepy sounds that telephones make, called “dual-tone multifrequency” (DTMF) tones, as a tool in exploring Benford’s Law. I’ve previously written about the discovery of this seemingly mystical result: Take almost any naturally-occuring dataset which consists of amounts of things, like the number of bees per county in the US, and then only look at the leftmost digit of each number in the dataset. Then, the percentage of the time that each digit (1 through 9) begins the number tends to follow a regular pattern: 1 is the leading digit 30 percent of the time, then 2 at 18 percent of the time, and continuing to decrease in frequency all the way to 9. In other words, it’s most likely that 1 is the first digit of the number of bees in Sonoma County, somewhat less likely that 2 is, and so on.

But what does Benford’s law sound like? To find out, I generated the phone beeps in R (based on the table in this Wikipedia article) corresponding to each of the digits 1 through 9 in order. The whole sample is 10 seconds long, so the length of each tone corresponds to the relative percentage of the time that digit is at the beginning of a number in the dataset. If all digits had the same frequency, they’d each take 1.1 seconds, but in Benford’s law the “1″ tone (as if I’d dialed “1″ on a phone) happens for 3 seconds, the “2″ tone for 1.8, and so on. So some theoretical dataset that obeyed Benford’s law perfectly would sound like this:

So what REAL dataset should we choose? After poking around on the delightful ScraperWiki, I found a dataset of Estonian web site metrics. Life doesn’t get any better than this! (All scraping credit goes to mystery hacker intgr.) My new DTMFBenford() functions were raring to get a piece of the action, so I went boldly forth:

source("DTMFBenfordFunctions.R") # Loads DTMF functions and requires tuneR package
Estonia <- read.csv("http://scraperwiki.com/scrapers/export/metrixstation/")
writeWave(DTMFBenford(Estonia$pageviews), "EstoniaPageViews.wav")

First, we grab the data from ScraperWiki. Then, the function DTMFBenford() figures out the frequency that each digit one through nine occurs–here in the number of page views per webpage per day. In the sound file below, your left speaker plays the tones of how the digits actually occur in the real dataset, and your right speaker is playing the same sound as above–the way they’d theoretically occur according to Benford’s law:

As you can hear, the sounds left and right speakers are pretty close to each other, indicating that Estonians are very well behaved. (At least as regards Benford’s law.) Smile wide, Benfordians! Drunk on our success, let’s try another one. It looks like there’s another column of our dataset called “newvisitors”–why not?

writeWave(DTMFBenford(Estonia$newvisitors), "EstoniaNewVisitors_Percent.wav")

What a mess! Fortunately, it’s a really cool-sounding mess. What happened here? It turns out the “newvisitors” column is the percentage of the total visitors that are new. Benford’s law is really about how things are spread out when you are counting things or measuring physical objects, so it works well for the number of page views as well as things like the lengths of dinosaur bones. So these percentages are way out of Benford’s juristiction, and the left and right channels of the sound don’t match.

We can use these percentages to calculate the actual count of new visitors, though, since the dataset includes the total visitors for each website as well:

Estonia$newvisitors.absolute <- (Estonia$newvisitors/100) * Estonia$visitors
writeWave(DTMFBenford(Estonia$newvisitors.absolute), "EstoniaNewVisitors_Abs.wav")

Estonian netizens are still Benford’s law-abiding after all. How boring! (I prefer the deviant tones of the percentages; but then again, I also like music like Death Ambient.)

I’m surprised and excited that this was fairly easy to implement in R after hunting down the right add-on packages–I hereby give an extravagant curtsey to to the lonely cabal of folks working with sound in R, especially Uwe Ligges (creator of tuneR). I look forward to making many more hideous noises with their unwitting help.

(DO try this at home! The R functions I created are available here with chatty commentary, and the example code for all these sounds is here.)

To leave a comment for the author, please follow the link and comment on his blog: Statisfactions: The Sounds of Data and Whimsy » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.