# The unicorn problem

October 13, 2012
By

(This article was first published on Probability and statistics blog » r, and kindly contributed to R-bloggers)

Let’s say your goal is to observe all known species in a particular biological category. Once a week you go out and collect specimens to identify, or maybe you just bring your binoculars to do some spotting. How long will it take you to cross off every species on your list?

I’ve been wondering this lately since I’ve begun to hang out on the Mushrooms of Québec flickr group. So far they have over 2200 species included in the photos. At least one of the contributors has written a book on the subject, which got me wondering how long it might take him to gather his own photos and field observations on every single known species.

My crude model (see R code at the end of this post) assumes that every time Yves goes out, he has a fixed chance of encountering every given species. In other words, if there were 1000 species to find, and he averages 50 species per hunt, then every species is assigned a 1/20 chance of being found per trip. Thus the total found on any trip would have a Poisson distribution with parameter 50.

This model is unrealistic for lots of reasons (can you spot all the bad assumptions?), but it does show one of the daunting problems with this task: the closer you get to the end, the harder it becomes to find the last few species. In particular, you can be stuck at 1 for a depressingly long time. Run the simulation with different options and look at the graphs you get. I’m calling this “The unicorn problem,” after Nic Cage’s impossible-to-rob car in the movie Gone in 60 Seconds.

Do you have a real-life unicorn of one sort or another?

```   species = 1000 findP = 1/20 trials = 100 triesToFindAll = rep(0, trials)       for(i in 1:trials) { triesNeeded = 0   leftToFind = rep(1, species) leftNow = species numberLeft = c()   while (leftNow > 0) {   found = sample( c(0,1), 1000, replace=TRUE, prob = c((1-findP),findP))   leftToFind = leftToFind - found   leftNow = length(leftToFind[leftToFind > 0])   numberLeft = c(numberLeft, leftNow)   triesNeeded = triesNeeded + 1   }   if(i == 1) { plot.ts(numberLeft, xlim=c(0, 2*triesNeeded), col="blue", ylab="Species left to find", xlab="Attempts") } else { lines(numberLeft, col="blue") }   triesToFindAll[i] = triesNeeded } ```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...