Computing kook density in R

September 24, 2012
By

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

Do you ever see strange lights in the sky? Do you wonder what really goes on in Area 51? Would you like to use your R hacking skills to get to the bottom of the whole UFO conspiracy? Of course, you would!

UFO data from infochimps is the focus of a data munging exercise in Chapter 1 of Machine Learning for Hackers by Drew Conway and John Myles White, two social scientists with a penchant for statistical computing.

The exercise starts with slightly messy data, proceeds through cleaning up some dates. I think I slightly improved on the code given in the book. Have a look (gist:3775873) and see if you agree.

Dividing the data up by state (for sightings in the US), I noticed something funny. My home state of Washington has a lot of UFO sightings. Normalizing by population, this becomes even more pronounced.

I learned a neat trick from the chapter. The transform function helps to compute derived fields in a data.frame. I use transform to compute UFO sightings per capita, after merging in population data by state from the 2000 census.

sightings.by.state <- transform(
 sightings.by.state,
 state=state, state.name=name,
 sightings=sightings,
 sightings.per.cap=sightings/pop)

Creating the plot above, with a pile of ggplot code, we see that Washington state really is off the deep end when it comes to UFO sightings. Our northwest neighbors in Oregon come in second. I asked a couple fellow Washington residents what they thought. The first reasonably conjectured a relationship to the number of air bases. The second Washingtonian gave the explanation I favor: "High kook density".

If you'd like to the data, it's from Chapter 1 of Machine Learning for Hackers. Data and code can be found in John Myles White's github repo.

To leave a comment for the author, please follow the link and comment on his blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.