June 25, 2012
By

(This article was first published on distributed ecology, and kindly contributed to R-bloggers)

Remember 1992? I had just turned 13 and was still a year away from highschool when my true descent into nerdom and the internet would begin. Back then it was with a local BBS (Bulletin Board System) run by a guy in a trailer park named Charlie and a 1200 baud modem on an old Compaq 386. But there were some intrepid ecologists who were already forging ahead with what we take for granted 20 years later, Ecolog. Last September, Ecolog hit 13000 subscribers. I took the data that David mentioned in his post and plotted it out and fitted a curve to it. The fit was based on an assumption of there being 100 subscribers when Ecolog first started. On Sunday he posted that there were now 15,000 subscribers so I decided to see how good my fit was. Here’s the original fit with new the new data point and the prediction (in red),I fit this exponential growth equation:
$N(d)=N_0e^{(rd)}$

where r in the growth rate, and d is the number of days since Sept. 1 1992, and
$N_0=100$

 The solid black dot is the true value, the red dot the fitted for the new data point

What should that initial starting value be? Clearly it’s not 100. Let’s begin by assuming that Ecolog grew exponentially. We’re missing a huge amount of data, so I’ll say up front that yes I know that there could be any number of weird things that happened between the start year of 1992 and the first year we have data, 2006. But exponential growth isn’t an unreasonable assumption. We know it’s not linear, a linear fit gives you ~ -15,000 subscribers at time 0! We could make guesses and keep plotting to eyeball our fit. The first problem to solve is how do we figure out the best fitting value for N0? It’s a perfect problem for the built-in optimization functions that R has, optim(). First we define a function that we want to minimize. Here we just use nls() with different values for N0 and minimize the deviance, and just plug that into optim().
fp.est <- function(param,dat){  z <- nls(dat[,2]~param*exp(r*dat[,1]),start=list(r=.001))  return(deviance(z))}tmp <- optim(1,fp.est,dat=dat,method="L-BFGS-B",lower=1,upper=6000)
That gives us a perfectly reasonable point estimate of 545, and vastly improves our fit as you can see.

We can even get a little creative with the data and come up with a measure of standard deviation. We do this with a jackknife style estimate of variance. Jackknife estimates are type of resampling that work by calculating some statistic repeatedly by removing one or more of the data points. All we do is come up with all possible combinations for our data that have 8 out 9 points, and run our optimization routine. This yields a final estimate for the number of subscribers of Ecolog at the very beginning as:
$N_0 = 545 \pm 33.8$

I have no idea if this is right, but it’s a neat little example of using optim() and jackknifing to solve a problem where we have good intuition (that exponential growth is the right function) and not very much data. Here’s a gist of all the code that this post used.

Thanks Drew. It does work, it gives you about the same answer. This solution I presented because I was thinking of the problem in two steps, one where we actually should know N_0, and use that to estimate r. But you're right, other than just as a trivial example of using optim() there's no reason not to use nls(). The only real reason I can think of is that the optim() method here is a bit more robust to hitting errors like: "Error….Singular gradient"
Very cool Ted! Why not use nls() to fit both parameters? I haven't messed with your code to see if that would work, but can't think of a reason offhand why it wouldn't?