# Computing kook density in R

September 24, 2012
By

Do you ever see strange lights in the sky? Do you wonder what really goes on in Area 51? Would you like to use your R hacking skills to get to the bottom of the whole UFO conspiracy? Of course, you would!

UFO data from infochimps is the focus of a data munging exercise in Chapter 1 of Machine Learning for Hackers by Drew Conway and John Myles White, two social scientists with a penchant for statistical computing.

The exercise starts with slightly messy data, proceeds through cleaning up some dates. I think I slightly improved on the code given in the book. Have a look (gist:3775873) and see if you agree.

Dividing the data up by state (for sightings in the US), I noticed something funny. My home state of Washington has a lot of UFO sightings. Normalizing by population, this becomes even more pronounced.

I learned a neat trick from the chapter. The transform function helps to compute derived fields in a data.frame. I use transform to compute UFO sightings per capita, after merging in population data by state from the 2000 census.

``````sightings.by.state <- transform(
sightings.by.state,
state=state, state.name=name,
sightings=sightings,
sightings.per.cap=sightings/pop)
``````

Creating the plot above, with a pile of ggplot code, we see that Washington state really is off the deep end when it comes to UFO sightings. Our northwest neighbors in Oregon come in second. I asked a couple fellow Washington residents what they thought. The first reasonably conjectured a relationship to the number of air bases. The second Washingtonian gave the explanation I favor: “High kook density”.

If you’d like to the data, it’s from Chapter 1 of Machine Learning for Hackers. Data and code can be found in John Myles White’s github repo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...