by Nina Zumel
Principal Consultant Win-Vector LLC
We've just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we've tried to touch on the highlights of the papers, and to play around with variations of our own.
- A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.
- Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
- A simple differentially private procedure: The bootstrap as an alternative to Laplace noise to introduce differential privacy.
The R code includes an example of using vtreat, a package for preparing and cleaning data frames based on level-based feature pruning.