# Mean Value from Grouped Data

**Statistical Research » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request

but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint. The biggest problem I have run into has been trying to decide what to do with two extremes. What does one do with the age ’65+’ category or the ‘over $100,000′ category. Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories. Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible. In this example I use voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation.

Group | Frequency |

18-30 | 25 |

30-45 | 43 |

45-60 | 42 |

60+ | 31 |

I have written some code that I have traditionally used for this type of work but have since found that the *LearnBayes* package contains the necessary functions as well. For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the *LearnBayes* package. This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.

###Note that this function is also available through the VGAM package. ###However, it is displayed here as part of the example. library(graphics); laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if (is.na(prob) == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) log.sd.y <- log(sd(y)) start <- c(mean.y,log.sd.y) fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2])

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistical Research » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.