A note about the estimating the number of visits to a site

[This article was first published on My contRibution » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

How many times do users visit your site? A very relevant question in web analysis.
One major issue that biases your estimation is the users who save no cookies, these users change their ID on every page visit so they look like 1 time visitors and thus inflate the proportion of 1 in the distribution. This has an enormous effect on the bounce rate you calculate to your site.
Another minor issue, which is a by product of the solution I’m about to offer, is that most distribution functions have a support of x>=0, while visits start at 1. Read on to see the suggested solution.

I’ve stumbled upon this issue a few times and would like to suggest the following solution to estimate the latent proportion of 1 time visitors via maximum likelihood estimation.

Assumptions

  1. There exists a constant proportion (P) of users who don’t save cookies.
  2. The number of visits to the site by users who do save cookies (X) follows some known distribution function (f) who’s parameters are unknown. The distribution is truncated at 0, as we will never see the users who visit 0 times. The CDF for f is denoted as F.

We define Y as the distribution of visits to your site :

Pr(y=1) = P + (1-P) * f(1) / (1-F(0))

For Y=y other than 1 :

Pr(Y=y) = (1-P) * f(y) / (1-F(0))

You might ask yourself why f(y) is  weighted with 1/(1-F(0)), since we are looking at the truncated version of the distribution this weighting factor is essential to keep the integral of the truncated distribution at 1. I’ll leave you to figure out the math.

Example

Here is an example when assuming the distribution f is Poisson.

first we construct the likelihood function:

Ind <- function(l) {
 if(is.logical(l)) as.numeric(l)
 else stop('"l" must be logical')
 }
fy <- function(x, lambda, log = FALSE) {
 if(!log) { Ind(x>=1) * dpois(x, lambda) / ppois(0, lambda, lower.tail = FALSE)}
 else { Ind(x>=1) * ( dpois(x, lambda, log=TRUE) - ppois(0, lambda, lower.tail = FALSE, log=TRUE))}
 }
Lik <- function(par,x) {
 p <- par[1]
 lambda <- par[2]
 n1 <- sum(x==1)
 n0 <- sum(x!=1)
 x0 <- x[x!=1]
n1 * log(p + (1-p) * fy(1,lambda)) + n0 * log(1-p) + sum(fy(x0,lambda,log=TRUE))
 }

Let’s create some data :

x <- c(rpois(5000,5),rep(1,600)) # a sample of 5000 values from a Poisson distribution with lambda = 5, artificially inflated with 600 1s.
par <- c(mean(x==1),mean(x)) # starting values for the estimation process
parOptim <- optim(par,fn=Lik,x=x,method="Nelder-Mead",control=list(fnscale=-1)) # using optim() to find the MLE
parOptim$par[1]
[1] 0.1092563

This means the estimated P is 10.9% of inflated one time visitors. Be sure to remove these guys from the distribution of visitors you get. But be sure to test a few distributions and find a way to judge between them prior to estimating the final numbers.

Good luck.


To leave a comment for the author, please follow the link and comment on their blog: My contRibution » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)