There's growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove "obvious" variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model.
As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, "Analyzing and Preventing Bias in ML". And for a practical demonstration of one way you can go about detecting proxy bias in R, take a look at the vignette created by my colleague Paige Bailey for the ROpenSci conference, "Ethical Machine Learning: Spotting and Preventing Proxy Bias".
The vignette details general principles you can follow to identify proxy bias in an analysis, in the context of a case study analyzed using R. The case study considers data and a predictive model that might be used by a bank manager to determine the creditworthiness of a loan applicant. Even though race was not explicitly included in the adaptive boosting model (from the C5.0 package), the predictions are still biased by race:
That's because zipcode, a variable highly associated with race, was included in the model. Read the complete vignette linked below to see how Paige modified the model to ameliorate that bias, while still maintaining its predictive power. All of the associated R code is available in the iPython Notebook.
GitHub (ropenscilabs): Ethical Machine Learning: Spotting and Preventing Proxy Bias (Paige Bailey)