Mean Value from Grouped Data

December 7, 2012
By

(This article was first published on Statistical Research » R, and kindly contributed to R-bloggers)

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request
but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint.  The biggest problem I have run into has been trying to decide what to do with two extremes.  What does one do with the age ’65+’ category or the ‘over $100,000′ category.  Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories.  Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible.  In this example I use voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation. 

Group Frequency
18-30 25
30-45 43
45-60 42
60+ 31

I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well.  For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package.  This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.


###Note that this function is also available through the VGAM package.
###However, it is displayed here as part of the example.
library(graphics);
laplace.dist <- function (logpost, mode, ...)
{
options(warn = -1)
fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1))
options(warn = 0)
mode <- fit$par
hess <- -solve(fit$hessian)
l.mode <- length(mode)
int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...)
ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0)
return(ret.val)
}
metropolis.randwalk <- function (logpost, proposal, start, m, ...)
{
pb <- length(start)
Mpar <- array(0, c(m, pb))
b <- matrix(t(start))
lb <- logpost(start, ...)
a <- chol(proposal$var)
scale <- proposal$scale
accept <- 0
for (i in 1:m) {
bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1))
lbc <- logpost(t(bc), ...)
prob <- exp(lbc - lb)
if (is.na(prob) == FALSE) {
if (runif(1) < prob) {
lb <- lbc
b <- bc
accept <- accept + 1
}
}
Mpar[i, ] <- b
}
accept <- accept/m
ret.val <- list(par = Mpar, accept = accept)
return(ret.val)
}
DataGroup <- function(theta, data){
cpoints=data$b

freq <- data$f
nbins <- length(cpoints)
m <- theta[1]
logsigma <- theta[2]
z <- 0*m
s <- exp(logsigma)
z <- freq[1]*log(pnorm(cpoints[1],m,s))
for(j in 1:(nbins-1)){
z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s))
}
z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s))
return(z)
}
contour.plot = function (logf, limits, data, ...) {
log.f = function(theta, data) {
if (is.matrix(theta) == TRUE) {
val = matrix(0, c(dim(theta)[1], 1))
for (j in 1:dim(theta)[1]) {
val[j] = logf(theta[j,], data)
}
} else {
val = logf(theta, data)
}
return(val)
}
ng <- 50
x0 <- seq(limits[1], limits[2], len = ng)
y0 <- seq(limits[3], limits[4], len = ng)
X <- outer(x0, rep(1, ng))
Y <- outer(rep(1, ng), y0)
n2 <- ng^2
Z <- log.f(cbind(X[1:n2], Y[1:n2]), data)
Z <- Z - max(Z)
Z <- matrix(Z, c(ng, ng))
contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2)
}
lo <- c(18,30,45,60)
hi <- c(30,45,60, Inf)
d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31))
y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31))
mean.y <- mean(y)
log.sd.y <- log(sd(y))
start <- c(mean.y,log.sd.y)
fit <- laplace.dist(DataGroup,start,d)
fit
modal.sds <- sqrt(diag(fit$var))
proposal <- list(var=fit$var,scale=2)
fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d)
fit2$accept
post.means <- apply(fit2$par,2,mean)
post.sds <- apply(fit2$par,2,sd)
cbind(c(fit$mode),modal.sds) ##These should be similar
cbind(post.means,post.sds)
contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d,
xlab="mu",ylab="log sigma")

points(fit2$par[1:2000,1],fit2$par[1:2000,2])

 

To leave a comment for the author, please follow the link and comment on his blog: Statistical Research » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.