(This article was first published on

Nick has a paper in the American Statistician warning about bias in multiple imputation arising from rounding data imputed under a normal assumption. One example where you might run afoul of this is if the data are truly dichotomous or count variables, but you model it as normal (either because your software is unable to model dichotomous values directly or because you prefer the theoretical soundness of multivariate normal imputation to, e.g., chained equations). In such cases, one might impute assuming normality, then round the imputed values to plausible integers. The paper shows theoretically the bias that can result if this process is pursued, and also that allowing the "implausible values" will eliminate the bias. (Of course, modeling the missing variable using a logistic regression model will be most appropriate here).**SAS and R**, and kindly contributed to R-bloggers)In another paper, Nick and Stuart Lipsitz (TAS 2001) comment that the method of predictive mean matching (PMM) "ensures that imputed values are plausible, and may be more appropriate if the normality assumption is violated." Briefly, the PMM method predicts a value from a model for both missing and observed values. The imputation for a subject with a missing value is the observed value of the subject with the nearest predicted value (or random draw of observed values from among the subjects with the nearest predicted values).

How does this play out in practice? Can the PMM method overcome the theoretical rounding bias while still generating only plausible imputed values?

**SAS**

We begin by simulating dichotomous data, choosing the value of

`p`(probability of 1) = .25, a value with a large absolute bias, according to the paper. We set values to missing with probability 0.5, using a MCAR mechanism. Then we use

`proc mi`(section 6.5, example 9.4) to impute the missing values, assuming normality. The mean and standard error of the mean of y are calculated in

`proc summary`(section 2.1.1) and combined in

`proc mianalyze`. Then the values are rounded manually and the analysis repeated. Next, we impute separately with PMM. Finally, we impute again with a logistic imputation. We use 5 imputations throughout, though 50 would likely be preferable.

Note that a Poisson regression imputation is not yet available for

`proc mi`, so that the exercise is not wholly academic--if you needed to impute count values, you'd have to choose among implausible values, rounding, and PMM. Also note our use of the

`fcs`imputation method, though it is not needed here with an obviously monotone missingness pattern. Finally, note that

`proc mi`here requires at least two variables, for no reason we know of. We generate a normally-distributed and uncorrelated covariate.

data testpmm;

do i = 1 to 5000;

x = normal(0);

y = rand('BINOMIAL', .25, 1);

missprob = ranuni(0);

if missprob le .5 then y = .;

output;

end;

run;

title "Normal imputation";

proc mi data=testpmm out=normal nimpute=5;

var x y;

fcs reg;

run;

title2 "Implausible values";

proc summary data = normal mean stderr;

by _imputation_;

var y;

output out=outnormal mean=meany stderr=stderry;

run;

proc mianalyze data = outnormal;

modeleffects meany;

stderr stderry;

run;

title2 "Rounded";

/* make the rounded data */

data normalrnd;

set normal;

if y lt .5 then y=0;

else y=1;

run;

proc summary data = normalrnd mean stderr;

by _imputation_;

var y;

output out=outnormalrnd mean=meany stderr=stderry;

run;

proc mianalyze data = outnormalrnd;

modeleffects meany;

stderr stderry;

run;

title "regpmm imputation";

proc mi data=testpmm out=pmm nimpute=5;

var x y;

fcs regpmm;

run;

...

title "logistic imputation";

proc mi data=testpmm out=logistic nimpute=5;

class y;

var x y;

fcs logistic;

run;

...

We omit the

`summary`and

`mianalyze`procedures for the latter imputations. Ordinarily, it would be easiest to do this kind of repetitive task with a macro, but we leave it in open code here for legibility.

The results are shown below

Normal imputation-- Implausible values

Parameter Estimate Std Error 95% Confidence Limits

meany 0.249105 0.008634 0.230849 0.267362

Normal imputation-- Rounded

meany 0.265280 0.006408 0.252710 0.277850

regpmm imputation

meany 0.246320 0.006642 0.233204 0.259436

logistic imputation

meany 0.255120 0.008428 0.237449 0.272791

As theory suggests, rounding the normally imputed values leads to bias, while using the normal imputations does not (though it results in implausible values). Nether PMM imputation nor direct logistic imputation appear to be biased.

**R**

We will use the mice package written by Stef van Buuren, one of the key developers of chained imputation. Stef also has a new book describing the package and demonstrating its use in many applied examples. We use 5 imputations throughout, though 50 would likely be preferable.

We begin by creating the data. Note that

`mice()`, like

`proc mi`, requires at least two columns of data. To do the logistic regression imputation,

`mice()`wants the missing data to be a factor, so we make a copy of the data as a data frame object as well.

library(mice)

n = 5000 # number of observations

m = 5 # number of imputations (should be 25-50 in practice)

x = rnorm(n)

y = rbinom(n, 1, .25) # interesting point according to Horton and Lipsitz (TAS 2004)

unif = runif(n)

y[unif < .5] = NA # make half of the Y's be missing

ds = cbind(y, x)

ds2 = data.frame(factor(y), x)

The mice package works analogously to

`proc mi`/

`proc mianalyze`. The

`mice()`function performs the imputation, while the

`pool()`function summarizes the results across the completed data sets. The

`method`option to

`mice()`specifies an imputation method for each column in the input object. Here we fit the simplest linear regression model (intercept only).

# normal model with implausible values

impnorm = mice(ds, method="norm", m=m)

summary(pool(with(impnorm, lm(y ~ 1))))

Rounding could be done by tampering with the mids-type object that

`mice()`produces, but there is a more direct way to do this through the

`post=`option. It accepts text strings with R commands that will be applied to the imputed values. Here we use the

`ifelse()`function to make the normal values equal to 0 or 1. The code for the predictive mean matching and logistic regression follow.

impnormround = mice(ds, method="norm", m=m,

post= c("imp[[j]][,i] = ifelse(imp[[j]][,i] < .5, 0, 1)",""))

imppmm = mice(ds, method="pmm", m=m)

implog = mice(ds2, method="logreg", m=m)

The results of

`summary(pool())`calls are shown below..

> summary(pool(with(impnorm, lm(y ~ 1))))

est se lo 95 hi 95

(Intercept) 0.272912 0.007008458 0.2589915 0.2868325

> summary(pool(with(impnormround, lm(y ~ 1))))

est se lo 95 hi 95

(Intercept) 0.28544 0.00854905 0.2676263 0.3032537

> summary(pool(with(imppmm, lm(y ~ 1))))

est se lo 95 hi 95

(Intercept) 0.277636 0.03180604 0.2145564 0.3407156

> summary(pool(with(implog, lm(y ~ 1))))

est se lo 95 hi 95

(Intercept) 0.2652899 0.00879988 0.2480342 0.2825457

The message on bias is similar, though there is some hint of trouble in the CI for the PMM method (it seems to have a bias towards 0.5). The default option of 3 donors may be too few (this can be tweaked by use of the

`donors = NUMBER`option).

To

**leave a comment**for the author, please follow the link and comment on his blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...