Dummy Is As Dummy Does

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the 1975 edition of “Applied multiple regression/correlation analysis for the behavioral sciences” by Jacob Cohen, an interesting approach of handling missing values in numeric variables was proposed with the purpose to improve the traditional single-value imputation, as described below:

– First of all, impute missing values by the value of mean or median
– And then create a dummy variable to flag out imputed values

In the setting of a regression model, both imputed and dummy variables would be included and therefore the number of independent variables are doubled.

Although the aforementioned approach has long been criticized and eventually abandoned by Cohen himself in the recent edition of the book, I was told that this obsolete technique is still being actively used.

Out of my own curiosity, I applied this dummy imputation approach to the data used in https://statcompute.wordpress.com/2019/05/04/why-use-weight-of-evidence and then compared it with the WoE imputation in the context of Logistic Regression.

Below are my observations:

– Since the dummy approach converts each numeric variable with missing values, the final model tends to have more independent variables, which is not desirable in terms of the model parsimony. For instance, there are 7 independent variables in the model with dummy imputation and only 5 in the model with WoE approach.

– The model performance doesn’t seem to justify the use of more independent variables in the regression with the dummy imputation. As shown in the output below, ROC statistic from the model with WoE approach is significantly better than the one with the dummy imputation based on the DeLong’s test, which is also consistent with the result of Vuong test.

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)