Imputation by mean?

[This article was first published on StaTEAstics., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, I was briefed that when computing the regional aggregates such as those defined by the M49 country standard of the United Nation (http://unstats.un.org/unsd/methods/m49/m49regin.htm) I should use the regional mean to replace missing values.

I was sceptical about this approach based on the little knowledge I had about missing value since the assumption required by the method is extremely strong.

(1) The missing value has to be in the form of MCAR (Missing completely at random), which is highly violated since missing value are more likely to come from countries where the statistical office are not well established or less developed.

(2) The method also required the data to be relatively symmetric, otherwise the mean will not be an unbiased estimate of the missing value.

So I decide to do some data checking and download some data from the nice World Bank (http://data.worldbank.org/) and see what the data look like.

## Read the name file, lets lets just work with the first 100 variables
WDI = read.csv(file = “http://dl.dropbox.com/u/18161931/WorldBankIndicators.csv”,
  stringsAsFactors = FALSE, nrows = 100)
WDI = WDI[-c(1:10), ]

## Download and merge the data. Some vairables are not collected in 2010
## and thus they are discarded
WDI.df = WDI(indicator = WDI$WDI_NAME[1], start = 2010, end = 2010)
for(i in 2:NROW(WDI)){
  tmp = WDI(indicator = WDI$WDI_NAME[i], start = 2010, end = 2010)
  if(!inherits(tmp, “try-error”) &
     (sum(is.na(tmp[, WDI$WDI_NAME[i]])) != NROW(tmp)))
    WDI.df = merge(WDI.df, tmp, by = c(“iso2c”, “country”, “year”))
}

## Produce histogram to examine the suitability of mean imputation
pdf(file = “dataDist.pdf”)
for(i in 3:NCOL(WDI.df)){
  hist(WDI.df[, i], breaks = 100, main = colnames(WDI.df)[i], xlab = NULL)
  abline(v = mean(WDI.df[, i], na.rm = TRUE), col = “red”)
  pctBelowMean = round(100 * sum(na.omit(WDI.df[, i]) <
    mean(WDI.df[, i], na.rm = TRUE))/length(na.omit(WDI.df[, i])), 2)
  legend(“topright”, legend = paste(pctBelowMean,
                       “% of data are below the mean”, sep = “”))
}
graphics.off()

From the saved plot we can clearly see that a large amount of variables are heavily skewed (typical for monetary and population related type data). In addition, we can see that the majority of the data lies far below the mean and thus if the mean imputation method was used to compute the aggregates, we would end up with an estimate biased significantly upwards.

To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)