A note about the estimating the number of visits to a site

October 21, 2013
By

(This article was first published on My contRibution » R, and kindly contributed to R-bloggers)

Motivation

How many times do users visit your site? A very relevant question in web analysis.
One major issue that biases your estimation is the users who save no cookies, these users change their ID on every page visit so they look like 1 time visitors and thus inflate the proportion of 1 in the distribution. This has an enormous effect on the bounce rate you calculate to your site.
Another minor issue, which is a by product of the solution I’m about to offer, is that most distribution functions have a support of x>=0, while visits start at 1. Read on to see the suggested solution.

I’ve stumbled upon this issue a few times and would like to suggest the following solution to estimate the latent proportion of 1 time visitors via maximum likelihood estimation.

Assumptions

  1. There exists a constant proportion (P) of users who don’t save cookies.
  2. The number of visits to the site by users who do save cookies (X) follows some known distribution function (f) who’s parameters are unknown. The distribution is truncated at 0, as we will never see the users who visit 0 times. The CDF for f is denoted as F.

We define Y as the distribution of visits to your site :

Pr(y=1) = P + (1-P) * f(1) / (1-F(0))

For Y=y other than 1 :

Pr(Y=y) = (1-P) * f(y) / (1-F(0))

You might ask yourself why f(y) is  weighted with 1/(1-F(0)), since we are looking at the truncated version of the distribution this weighting factor is essential to keep the integral of the truncated distribution at 1. I’ll leave you to figure out the math.

Example

Here is an example when assuming the distribution f is Poisson.

first we construct the likelihood function:

Ind <- function(l) {
 if(is.logical(l)) as.numeric(l)
 else stop('"l" must be logical')
 }
fy <- function(x, lambda, log = FALSE) {
 if(!log) { Ind(x>=1) * dpois(x, lambda) / ppois(0, lambda, lower.tail = FALSE)}
 else { Ind(x>=1) * ( dpois(x, lambda, log=TRUE) - ppois(0, lambda, lower.tail = FALSE, log=TRUE))}
 }
Lik <- function(par,x) {
 p <- par[1]
 lambda <- par[2]
 n1 <- sum(x==1)
 n0 <- sum(x!=1)
 x0 <- x[x!=1]
n1 * log(p + (1-p) * fy(1,lambda)) + n0 * log(1-p) + sum(fy(x0,lambda,log=TRUE))
 }

Let’s create some data :

x <- c(rpois(5000,5),rep(1,600)) # a sample of 5000 values from a Poisson distribution with lambda = 5, artificially inflated with 600 1s.
par <- c(mean(x==1),mean(x)) # starting values for the estimation process
parOptim <- optim(par,fn=Lik,x=x,method="Nelder-Mead",control=list(fnscale=-1)) # using optim() to find the MLE
parOptim$par[1]
[1] 0.1092563

This means the estimated P is 10.9% of inflated one time visitors. Be sure to remove these guys from the distribution of visitors you get. But be sure to test a few distributions and find a way to judge between them prior to estimating the final numbers.

Good luck.


To leave a comment for the author, please follow the link and comment on his blog: My contRibution » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.