Welcome to Missing Data Club.
There are only three rules.
Rule #1 is: There is no missing data.
Rule #2 is: THERE IS NO MISSING DATA!
Rule #3: If you’ve never built a model using missing data – you must do it now.
The most consistent and simple difference I see in working with students and professionals in statistics and data science is students have almost no exposure to real data.
But it surprises me how often I run into professionals who fall back on deletion or mean/median imputation when confronted with missing data. Real data almost always contains missing values. When I say “missing” I specifically mean that an observation in a data set has a feature that is NULL or NA (or a similar value) even though other observations contain more obviously meaningful data. The reason I say there is no such thing as missing data is because it is so extremely rare to find a data set where values are missing entirely at random that it isn’t even worth mentioning. A missing value almost always means something, even with data collected from scientific apparatus – and it’s only more true for sociological data. Bioinformatics is one of the few areas I’ve seen data sets pop up where the missing values in a feature or observation truly look pretty close to random.
Because missing data almost always has real meaning – that also means model-based imputation methods are often worthless too. I’ll grant that a model-based imputation will typically yield a boost in model performance over a mean/median imputation, but you’re still trampling all over your problem when you do that. The data is trying to tell you something. Shut up and listen.
I’m no ivory tower academic. If something works, I’ll do it. Each problem is unique and I have no scruples about deleting or imputing a value if it’s the best way to go. That said, here are a few of the techniques that work better in my experience:
Check Yo’ Self Before You Wreck Yo’ Self
You know it, I know it, and Ice Cube knows it – but it has to be said. The first thing to do is learn about the domain of your model. The worst case scenario is there may be an issue with data collection or data preparation. Best case scenario? You might learn something that completely changes how you think about the system you intend to model.
If you’re working with a feature that is a factor, this is an easy choice. If the data is continuous, the decision requires a bit of thought. How rich is the relationship between the feature and the outcome? Does the feature have important interactions with other variables? Do you know enough about the data to bucket it out yourself? Do you have a supervised or unsupervised discretization method you’re happy with?
Typically, if I’m trying to squeeze every bit of accuracy I can get out of a model this is an iterative process. I’ll start with something simple like an equal-width or equal-frequency discretization and later try an MDL or Chi-Merge method. Simple guesses on equal-width or equal-freq bucket sizes often work surprisingly well though. Especially if you create some visualizations to guide your choice.
If there’s enough data (which is almost always the case for the types of problems I work on, but isn’t in some arenas like health care) modeling the missing and non-missing separately can be a good way to go. Kaggle recently ran a contest on credit scoring where modeling the data separately was a good choice.
Stepping Off My Soap Box …
Every problem is different. Every rule of thumb is flawed.
What will always be true is that missing data IS data. Love it. Respect it.