I recently entered kaggle titanic learning competition for fun and to see where my out of the box utilization of random forest would rank me (303 out of 5,882). It was interesting to see that much of the scoring differentiation came from score imputation, that is filling missing values based on other data. For example, we might have a row where we know a female passenger’s class but not her fare, but we could infer on what her fare might be based on the averages or medians of other 1st class female passengers. Below are some quick notes on imputation that might be helpful to others.
Common methods of dealing with unknown or missing values:
- Removing observations with unknown values.
- Filling in unknowns with the most frequent values
- Filling in unknown values by exploring correlations
- Filling in Unknown values by exploring similarities between cases
Some approaches to missing data disregard complete rows if they have an NA in a column. This technique is easy but can lead to biased estimates as the “missingness” of the data can serve to describe the data (more on this in the available-case analysis section). Simply disregarding data can also lead to estimates larger standard errors due to the reduced sample size.
R automatically excludes any rows where the outcome or inputs are missing in classical regression, but there are two potential problems with this:
- Units with missing values can diﬀer systematically from the complete observed cases which can bias the complete-case analysis.
- If a model contains many variables, there maybe a lack of complete cases. This would cause most of the collected data to be disregarded for the sake of a simple analysis.
Let’s consider a household survey where the researcher presents estimates of response rates on subgroups of the target population such as age, race, gender, area (urban, rural, etc). If the response rates are similar across subgroups the researcher might assert that there is no evidence of a nonresponse bias. A flaw presents it self when there are low response rate groups and the researcher states the groups are unimportant or attempts to control for this lack of completeness. For this specific example of a household study the researcher is lucky as he/she could attempt to fill specific data points from the known distributions in the most recent census data.
How can we systematically fill in missing values? Random forest for example, can not handle missing values so you have to impute the data before running the function. There are many options here, for example you might want to fill missing age values with the median of that column. One could get more specific and fill it will the median based of an attribute of that observation or other column. In other words you could use a more specific median such as the gender, region, and/or ethnicity of the observation.
Randomforest does not handle missing data so a user has to address it before building a model. There are a couple field specific packages for imputing data in R, so you check them out after you’ve prepped your data or feel that you’ve maxed out your manual imputation abilities. I’ve used the missforest and imputation package with success.