Mean Value from Grouped Data

[This article was first published on Statistical Research » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request
but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint.  The biggest problem I have run into has been trying to decide what to do with two extremes.  What does one do with the age ’65+’ category or the ‘over $100,000′ category.  Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories.  Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible.  In this example I use voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation. 

Group Frequency
18-30 25
30-45 43
45-60 42
60+ 31

I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well.  For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package.  This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.


###Note that this function is also available through the VGAM package.
###However, it is displayed here as part of the example.
library(graphics);
laplace.dist <- function (logpost, mode, ...)
{
options(warn = -1)
fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1))
options(warn = 0)
mode <- fit$par
hess <- -solve(fit$hessian)
l.mode <- length(mode)
int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...)
ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0)
return(ret.val)
}
metropolis.randwalk <- function (logpost, proposal, start, m, ...)
{
pb <- length(start)
Mpar <- array(0, c(m, pb))
b <- matrix(t(start))
lb <- logpost(start, ...)
a <- chol(proposal$var)
scale <- proposal$scale
accept <- 0
for (i in 1:m) {
bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1))
lbc <- logpost(t(bc), ...)
prob <- exp(lbc - lb)
if (is.na(prob) == FALSE) {
if (runif(1) < prob) {
lb <- lbc
b <- bc
accept <- accept + 1
}
}
Mpar[i, ] <- b
}
accept <- accept/m
ret.val <- list(par = Mpar, accept = accept)
return(ret.val)
}
DataGroup <- function(theta, data){
cpoints=data$b

freq <- data$f
nbins <- length(cpoints)
m <- theta[1]
logsigma <- theta[2]
z <- 0*m
s <- exp(logsigma)
z <- freq[1]*log(pnorm(cpoints[1],m,s))
for(j in 1:(nbins-1)){
z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s))
}
z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s))
return(z)
}
contour.plot = function (logf, limits, data, ...) {
log.f = function(theta, data) {
if (is.matrix(theta) == TRUE) {
val = matrix(0, c(dim(theta)[1], 1))
for (j in 1:dim(theta)[1]) {
val[j] = logf(theta[j,], data)
}
} else {
val = logf(theta, data)
}
return(val)
}
ng <- 50
x0 <- seq(limits[1], limits[2], len = ng)
y0 <- seq(limits[3], limits[4], len = ng)
X <- outer(x0, rep(1, ng))
Y <- outer(rep(1, ng), y0)
n2 <- ng^2
Z <- log.f(cbind(X[1:n2], Y[1:n2]), data)
Z <- Z - max(Z)
Z <- matrix(Z, c(ng, ng))
contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2)
}
lo <- c(18,30,45,60)
hi <- c(30,45,60, Inf)
d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31))
y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31))
mean.y <- mean(y)
log.sd.y <- log(sd(y))
start <- c(mean.y,log.sd.y)
fit <- laplace.dist(DataGroup,start,d)
fit
modal.sds <- sqrt(diag(fit$var))
proposal <- list(var=fit$var,scale=2)
fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d)
fit2$accept
post.means <- apply(fit2$par,2,mean)
post.sds <- apply(fit2$par,2,sd)
cbind(c(fit$mode),modal.sds) ##These should be similar
cbind(post.means,post.sds)
contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d,
xlab="mu",ylab="log sigma")

points(fit2$par[1:2000,1],fit2$par[1:2000,2])

 

To leave a comment for the author, please follow the link and comment on their blog: Statistical Research » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)