**Win-Vector Blog » R**, and kindly contributed to R-bloggers)

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

The Voight-Kampff Test: Looking for a difference. Scene from

*Blade Runner*

## The Definition of Differential Privacy

We can define epsilon-differential privacy through the following game:

- A learner implements a summary statistic called A().
- A (notional) adversary proposes two data sets S and S’ that differ by only one row or example, and a test set Q.
- A() is called
*epsilon-differentially private*iff:

| log( Prob[A(S) in Q] / Prob[A(S’) in Q] ) | ≤ epsilon

for all of the adversary’s choices of S, S’ and Q. The probabilities are defined over coin flips in the implementation of A(), not over the data or the adversary’s choices.

The adversary’s goal is to use A() to tell between S and S’, representing a failure of privacy. The learner wants to extract useful statistics from S and S’ without violating privacy. Identifying a unique row (the one which changed markings) violates privacy. If the adversary can tell which set (S or S’) the learner is working on by the value of A(), then privacy is violated.

Notice S and S’ are “data sets” in the machine learning sense (collections of rows carrying information). Q is a set in the mathematic sense: a subset of the possible values that A() can return. To simplify the demonstration, we will take Q to be an interval.

In the following examples, A() returns the (approximate) expected value of a set *s*. The adversary has chosen two sets S and S’ of size *n* = 100:

- S = {0,0,0,…,0} (100 zeros)
- S’ = {1,0,0,…,0} (1 one and 99 zeroes)

The set Q will be an interval [T, 1], where the adversary picks the threshold T.

The adversary’s goal is to pick T such that when he sees that A(*s*) ≥ T, he knows that A() has just evaluated S’. The learner has two (competing) goals:

- To pick an algorithm A() such that A(S) and A(S’) are so “close” that the adversary can’t pick a reliable T, to preserve differential privacy. “Close” is defined by epsilon.
- To have A() be a good estimate of the expectation, for performing useful analysis.

## The Deterministic Case

Here, A(*s*) simply returns the expected value `mean(s)`

, so it always returns A(S) = 0 when evaluating S, and A(S’) = 0.01 when evaluating S’. This is clearly not differentially private for any value of epsilon. If the adversary picks T = 1/200 = 0.005, then he can reliably identify when A() has evaluated the set S’, every time they play a round of the game.

In the figures, *set1* in green represents the distribution of values returned by calls to A(S), and *set2* in orange returns the distribution of values returned by calls to A(S’). I’ve plotted *set2* upside down for clarity.

## Adding Noise to give Privacy

One way for the learner to obscure which set she is evaluating is to add a little random noise to the estimate, to “blur” it. Following Dwork et.al.’s methods, we’ll add noise from a Laplace distribution, which is symmetric and falls off slower than gaussian noise would. Here we show adding just a little bit of noise (of “width” sigma = 1/3n):

The shaded green region represents the chance that A(S) will return a value greater than the adversary’s threshold T — in other words, the probability that the adversary will mistake S for S’ in a round of the game. We’ve now made that probability non-zero, but it’s still much more likely that if A(*s*) > T, then *s* = S’. We need more noise.

In particular, we need the noise to be bigger than the gap A(S’)-A(S) = 1/n, or 0.01. Let’s pick sigma = 3/n = 0.03:

Now we see that the shaded green region has almost the same area as the shaded orange region — you can think of epsilon as expressing the difference between the shaded green region and the shaded orange region. In fact, the absolute value of the logratio of the two areas is epsilon. In other words, observing that A(*s*) > T is no longer strong evidence that *s* = S’, and we have achieved differential privacy, to the degree epsilon.

Of course, A(*s*) is also no longer a precise estimate of `mean(s)`

. We’ll return to that point in a bit.

### Simulating the Game

We can simulate the game I described above in R. I’ve put the code on github, here.

The code sweeps through a range of values of epsilon, and for each value selects a suitable noise width and runs 1000 rounds of the differential privacy game, returning a value for A(S) and A(S’). Each round we record whether A(S) and A(S’) are greater than T = 1/200.

## From Differential Privacy to Machine Learning

Differential privacy aims to make the answers to “snooping queries” too vague to distinguish closely related sets (in this case, it makes the probability that A(S) ≥ T about the same as the probability that A(S’) ≥ T). But for machine learning, we are also interested in the output of A(). We can use the simulation above to estimate what happens to A(S) and A(S’) for different values of epsilon. Here we plot the (normalized) gap between the expected values of A(S) and A(S’) as a function of epsilon, from the simulation:

As epsilon gets smaller (implying stricter privacy), the relative gap between the expected values of A(S) and A(S’) gets smaller. The discrepancies in the plot are probably due to poor choices of the noise parameter (I picked them heuristically), but the trend is clear. This makes sense, since privacy implies that the A(S) should look a lot like A(S’).

However, as epsilon gets stricter, the estimates of `mean(S) = 0`

and `mean(S') = 0.01`

— which is what A() is supposed to estimate — also become poorer.

This is the trade-off: how can we preserve privacy and still perform useful analysis?

### Differential Privacy and Adaptive Data Analysis

The recent papers by Dwork, et.al. apply differential privacy to the problem of adaptive data analysis, or reusable test sets. In standard machine learning practice, we use two data sets to model: a training set to fit the model, and a test set to evaluate the model. Ideally we only use the test set once; but in practice we go back to it again and again, as we try to improve our model. We already know that performance estimates of a model over its training set are upwardly biased (the model looks more accurate on training than it really is, because it “knows” the training set); if we go back to the test set too many times, then performance estimates of the model on the test set are *also* upwardly biased, because we also “know” the test set. This problem is exacerbated when we also use the test set to tune the model: for example to pick the variables we use, or tune modeling parameters. In other words, even if we observe the best practice of using a training set to fit models, a calibration set to tune models, and a holdout set to evaluate the model, we can also contaminate the calibration set if we look at it too many times.

Dwork, et.al., describe how (and how many times) we can go back to a test set without biasing its evaluations of model performance. To do this, we make sure that the modeling procedure only interacts with the test set in a differentially private manner. Because it has limited access to the test set, the modeling procedure can’t learn the test set, so evaluations of a model over the test set still accurately reflect a model’s performance on unseen data.

Describing the proofs and techniques in these papers is outside the scope of this article, but we can demonstrate the results in a simple example, similar to the example application shown in Dwork, et.al.’s *Science* paper.

### Using Differential Privacy for Stepwise Regression

For our example, suppose we have 2000 rows of data with a binary outcome *y* (50% prevalence of the positive class), and 110 possible input variables to choose from. In our tests we used synthetic data with ten variables (`x1...x10`

) that carry signal and one hundred noise variables (`n1...n100`

).

We decide to use forward-stepwise logistic regression to choose a suitable model for predicting *y*. We split the sample data into 1000 rows for the training set, and 1000 rows for the test set. Given that we have selected k-1 variables and have fit a model, *M(k-1)*, that uses those variables, we pick the kth variable by fitting a model *Mnew* on the training set using the (k-1) previously selected variables plus one new variable from the remaining candidates. We then evaluate the accuracy of *M(k-1)* and *Mnew* on the test set, and pick as the kth variable the one for which *Mnew* showed the most improvement. In other words, we fit candidate models on the training set, and evaluate them for improvement on the test set. This is a fairly common way of tuning models. Standard implementations of stepwise regression often use training set AIC as the criterion for picking the next model; we use accuracy improvement to keep the code straightforward, and closer to the assumptions in the paper.

Normally, the stepwise procedure would terminate when the model stops improving, but for this example we let it run to pick fifty variables. At each step we recorded the accuracy of the selected model on the training set (green circles), the test set (orange triangles) and on a fresh holdout set of 10,000 rows (purple squares). We picked this additional large holdout set (called “fresh”) to get a good estimate of the true accuracy of the model. Here’s the performance of the naive procedure:

The test set performance estimates were are more upwardly biased than the training set estimates! This is not too surprising, since we are using the test set to select variables. Despite the optimistic curve of the test scores, we can see from the true out-of-sample performance (`freshScore`

) that the model performance is not much better than random; in fact, the algorithm only picked one of the ten signal carrying variables (the first one selected). The test set (and to a lesser extent, the training set) believe that performance continues to improve, when in fact performance degrades as more noise variables swamp out the single signal variable.

We want to use the test set to pick variables, while also using it to accurately estimate out-of-sample error. In their recent *Science* article, Dwork, et.al. propose an algorithm called *Thresholdout* to let us safely reuse the test set. We implemented our `diffPrivMethod`

based on *Thresholdout*. `diffPrivMethod`

evaluates proposed models on both the training and test sets. If the model’s accuracies on the test and the training sets are within a certain tolerance, then `diffPrivMethod`

returns the training set accuracy with a bit of Laplacian noise as the model score. If the two accuracies are far apart, then `diffPrivMethod`

adds a bit of Laplacian noise to the test set accuracy, and returns that as the model score. In R, the code would look something like this:

# Return an estimate of model performance in a # differentially private way # v is the new candidate variable # varSet is the set of already selected variables diffPrivMethod = function(v) { # Original model was fit on # training data using varSet; # Fit a new model using varSet+v # using the training data # and evaluate the difference # between the new models on the test set testScore <- modelScoreImprovement(varSet,v, trainData, testData) # Do the same thing, but evaluate # the improvement on the training set trainScore <- modelScoreImprovement(varSet,v, trainData, trainData) if(abs(testScore-trainScore)<=eps/2) { # if the scores are close, # return trainScore with a little noise sc <- trainScore + rlaplace(1,eps/2) } else { # otherwise, return testScore # with a little more noise sc <- testScore + rlaplace(1,eps) } # make sure the score is in range [0,1] pmax(0,pmin(1,sc)) }

We picked the tolerance and the widths of the Laplace noise somewhat arbitrarily.

The beauty of this scheme is that we never look directly at the test set. If the test and training set performances are similar, we see the (noisy) training performance; when we do look at the test set, we only see a noisy view of it, so it leaks information more slowly.

The original *Thresholdout* algorithm does not add noise to the training set performance, probably under the reasoning that we know it already. We decided to add noise anyway, on the principle that even knowing *when* you are using the test set performance leaks a little information, and we wanted to reduce the leakage a little more.

Here are the results of the stepwise regression using `diffPrivMethod`

(we set the tolerance/noise parameter eps=0.04 for this run):

Now the test set estimates the true out-of-sample accuracies quite well. The chosen models achieve a peak accuracy of about 61%, at around 36 variables. This method did a better job of picking variables — it found all ten signal variables — though it did start picking noise variables fairly early on.

However, there is still money left on the table. What if we used a really large test set? Here, we show the results of running the naive stepwise regression (no differential privacy), with the same training set of 1000 rows and a test set of 10,000 rows.

Now the learning algorithm picks nine of the signal variables immediately, and achieves an accuracy of 62%. We can see that the model’s performance on the test set is still upwardly biased, but much less so than with the naive method. Because the test set is larger, we don’t contaminate it as much, even when looking at it directly.

This suggests that we can think of differential privacy as letting us simulate having a larger test set, though obviously we still need our actual test set to be a reasonable size, especially since stepwise regression requires a lot of queries to the test set. Our results also show that while these new differential-privacy based schemes let us do a good job of estimating out-of-sample performance, that alone is not enough to insure best possible model performance.

All code, data, and examples for this experiment can be found here.

## References

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “Preserving Statistical Validity in Adaptive Data Analysis”, arXiv:1411.2664, April 2015.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “The reusable holdout: Preserving validity in adaptive data analysis”, *Science*, vol 349, no 6248 pp 636-638, August 2015. [link to abstract]

Blum, Avrim and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”, arXiv:1502.04585, February 2015.

**leave a comment**for the author, please follow the link and comment on their blog:

**Win-Vector Blog » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...