I take my title here from the “too clever by half” paper, “What’s Not What with Statistics” of many years ago. Or I just as appropriately could have borrowed the old Leo Breiman classic title “A Tale of Two Cultures,” comparing the statistics and computer science (CS) communities.
Differential privacy (DP) is an approach to maintaining the privacy of individual records in a database, while still allowing statistical analysis. It is now perceived as the go-to method in the data privacy area, enjoying adoption by the US Census Bureau and several major firms in industry, as well as a highly visible media presence. DP has developed a vast research literature. On the other hand, it is also the subject of controversy, and now, of lawsuits.
Here is a summary of this essay:
- DP overpromises, a “solution to the data privacy problem, which provides a quantified guarantee of privacy.”
- On the contrary, DP fails to deliver on the guarantee, for a large class queries on quantities typically arising in business, industry, medicine, education and so on. The promise is illusory, with the method often producing biased results and inaccurate guarantees.
- The problems are in large part due to DP having been developed by CS researchers, rather than by statisticians or other get-your-hands-dirty data analysis professionals.
Some preparatory remarks:
I’ve been doing research in the data privacy area off and on for many years, e.g. IEEE Symposium on Security and Privacy; ACM Trans. on Database Systems; several book chapters; and current work in progress, arXiv. I was an appointed member of the IFIP Working Group 11.3 on Database Security in the 1990s. The ACM TODS paper was funded in part by the Census Bureau.
I will take as a running example one that was popular in the classic privacy literature. Say we have an employee database, and an intruder knows there is just one female electrical engineer. The intruder may then submit a query for the mean salary of all female EEs, and thus illicitly obtain this worker’s salary.
Notation: the database consists of n records on p variables.
The R package diffpriv makes use of standard DP methods easy to implement, and is recommended for any reader who wishes to investigate these issues further.
What is DP?
Technically DP is just a criterion, not a method, but the term generally is taken to mean methods whose derivation is motivated by that criterion.
DP is actually based on a very old and widely-used approach to data privacy, random noise perturbation. It’s quite simple. Say we have a database that includes a salary variable, which is considered confidential. We add random, mean-0 noise, to hide a person’s real income from intruders.
The motivation is that, since the added noise has mean 0, researchers doing legitimate statistical analysis can still do their work. They work with averages, and the average salary in the database in noisy form should be pretty close to that of the original data, with the noise mostly “canceling out.” (We will see below, though, that this view is overly simple.)
In DP methods, the noise is typically added to the final statistic, e.g. to a mean of interest, rather than directly to the variables.
One issue is whether to add a different noise value each time a query arrives at the data server, vs. adding noise just once and then making that perturbed data open to public use. DP methods tend to do the former, while classically the latter approach is used.
A related problem is that a DP version needs to be developed for every statistical method. If a user wants, say, to perform quantile regression, she must check whether a DP version has been developed and code made available for it. With classical privacy methods, once the dataset has been perturbed, users can apply any statistical method they wish. I liken it to an amusement park. Classical methods give one a “day pass” which allows one to enjoy any ride; DP requires a separate ticket for each ride.
With any data privacy method, DP or classical, there is no perfect solution. One can only choose a “dial setting” in a range of tradeoffs. The latter come in two main types:
- There is a tradeoff between protecting individual privacy on the one hand, and preserving the statistical accuracy for researchers. The larger the variance of added noise, the greater the privacy but the larger the standard errors in statistical quantities computed from the perturbed data.
- Equally important, though rarely mentioned, there is the problem of attenuation of relationships between the variables. This is the core of most types of data analysis, finding and quantifying relationships; yet the more noise we add to the data, the weaker the reported relationships will be. This problem arises in classical noise addition, and occurs in some DP methods, such as ones that add noise to counts in contingency tables. So here we have not only a variance problem but also a bias problem; the absolute values of correlations, regression coefficients and so on are biased downward. A partial solution is to set the noise correlation structure equal to that of the data, but that doesn’t apply to categorical variables (where the noise addition approach doesn’t make much sense anyway).
Other classical statistical disclosure control methods:
Two other major data privacy methods should be mentioned here.
- Cell suppression: Any query whose conditions are satisfied by just one record in the database is disallowed. In the example of the female EE above, for instance, that intruder’s query simply would not be answered. One problem with this approach is that it is vulnerable to set-differencing attacks. The intruder could query the total salaries of all EEs, then query the male EEs, and then subtract to illicitly obtain the female EE’s salary. Elaborate methods have been developed to counter such attacks.
- Data swapping: For a certain subset of the data — either randomly chosen, or chosen according to a record’s vulnerability to attack — some of the data for one record is swapped with that of a similar record. In the female EE example, we might swap occupation or salaries, say.
Note that neither of these methods avoids the problem of privacy/accuracy tradeoffs. In cell suppression, the more suppression we impose, the greater the problems of variance and bias in stat analyses. Data swapping essentially adds noise, again causing variance and bias.
The DP privacy criterion:
Since DP adds random noise, the DP criterion is couched in probabilistic terms. Consider two datasets, D and D’, with the same variables and the same number of records n, but differing in 1 record. Consider a given query Q. Denote the responses by Q(D) and Q(D’). Then the DP criterion is, for any set S in the image of Q,
P(Q(D) in S) < P(Q(D’) in S) exp(ε)
for all possible (D,D’) pairs and for a small tuning parameter ε. The smaller ε, the greater the privacy.
Note that the definition involves all possible (D,D’) pairs; D here is NOT just the real database at hand (though there is a concept of local sensitivity in which D is indeed our actual database). On the other hand, in processing a query, we ARE using the database at hand, and we compute the noise level for the query based on n for this D.
DP-compliant methods have been developed for various statistical quantities, producing formulas for the noise variance as a function of ε and an anticipated upper bound on |Q(D) – Q(D’)|. Again, that upper bound must apply to all possible (D,D’) pairs. For human height, say, we know that no one will have height, say, 300 cm, which we divide by n for a mean; it’s a rather sloppy bound, but it would work.
Problems with DP’s claims of quantifiable guaranteed privacy:
(Many flaws have been claimed for DP, but to my knowledge, this analysis is new.)
Again consider the female EE example. One problem that arises right away is that, since this is a condtional mean, Q(D) and/or Q(D’) will be often be undefined.
Say we try to address that issue by expressing Q() as an unconditional mean, E(Y A), divided by an unconditional probability E(A), where A is the indicator variable for the condition. At the data level, these are both averages, and we take Q() to mean querying the two averages separately. We then run into a more formidable problem, as follows.
For even moderately large n, the amount of noise added to each of these averages will be small — not what we want at all. If there is only 1 female EE, we want the amount of noise added to her salary to be large.
Another approach would be to consider all possible pairs of databases D and D’, each consisting of female EEs. Remember, in the definition of DP, we are anticipating all potential (D,D’) pairs. Then, in computing noise level, we would have n = 1, so the noise level would be appropriately large.
But that won’t work. We would still have the problem described for the cell suppression method above: An intruder could employ set-differencing, say querying total salaries for all EE workers, then total for all male EEs — and here the added noise would definitely be small (after dividing by n). Then the intruder wins.
The fact that the above analysis involves just a single record is irrelevant. A similar analysis can be made for any conditional mean or probability. Again consider the employee database. Say only 10% of the workers are women. Then the same reasoning shows that the noise added to a query involving women will either be too small or vulnerable to a set-differencing attack.
So, the much-vaunted claims of DP advocates that DP gives “quantifiable guarantees of privacy” are simply not true. Yes, they are fine for univariate statistics, but they fall apart in conditional cases. And conditional cases are the bread and butter of virtually any statistical study.
As noted, DP methods that work on contingency tables by adding independent noise values to cell counts can attenuate correlation and thus produce bias. The bias will be substantial for small tables.
Another issue, also in DP contingency table settings, is that bias may occur from post-processing. If Laplacian noise is added to counts, some counts may be negative. As shown in Zhu et al, post-processing to achieve nonnegativity can result in bias.
The US Census Bureau’s adoption of DP:
The Census Bureau’s DP methodology replaces the swapping-based approach used in past census reports. Though my goal in this essay has been mainly to discuss DP in general, I will make a few comments.
First, what does the Bureau intend to do? They will take a 2-phase approach. They view the database as one extremely large contingency table (“histogram,” in their terminology). Then they add noise to the cell counts. Next, they modify the cell counts to satisfy nonnegativity and certain other constraints, e.g. taking the total number of residents in a census block to be invariant. The final perturbed histogram is released to the public.
Why are they doing this? The Bureau’s simulations indicate that, with very large computing resources and possibly external data, an intruder could reconstruct much of the original, unperturbed data.
The Bureau concedes that the product is synthetic data. Isn’t any perturbed data synthetic? Yes, but here ALL of the data is perturbed, as opposed to swapping, where only a small fraction of the data changes.
Needless to say, then, use of synthetic data has many researchers up in arms. They don’t trust it, and have offered examples of undesirable outcomes, substantial distortions that could badly effect research work in business, industry and science. There is also concern that there will be serious impacts on next year’s congressional redistricting, which strongly relies on census data, though one analysis is more optimistic.
There has already been one lawsuit against the Bureau’s use of DP. Expect a flurry more, after the Bureau releases its data — and after redistricting is done based on that data. So it once again boils down to the privacy/accuracy tradeoff. Critics say the Bureau’s reconstruction scenarios are unlikely and overblown. Again, add to that the illusory nature of DP’s privacy guarantees, and the problem gets even worse.
How did we get here? As seen above, DP has some very serious flaws. Yet it has largely become entrenched in the data privacy field. In addition to being chosen as the basis of the census data, it is used somewhat in industry. Apple, for instance, uses classic noise addition, applied to raw data, but with a DP privacy budget.
As noted, early DP development was done mainly by CS researchers. CS people view the world in terms of algorithms, so that for example they feel very comfortable with investigating the data reconstruction problem and applying mathematical optimization techniques. But the CS people tend to have poor insight into what problems the users of statistical databases pursue in their day-to-day data analysis activities.
Some mathematical statisticians entered the picture later, but by then DP had acquired great momentum, and some intriguing theoretical problems had arisen for the math stat people to work on. Major practical issues, such as that of conditional quantities, were overlooked.
In other words, in my view, the statistician input into the development of DP came too little, too late. Also, to be frank, the DP community has not always been open to criticism, such as the skeptical material in Bambauer, Muralidhar and Sarathy.
Statistical disclosure control is arguably one of the most important data science issues we are facing today. Bambauer et al in the above link sum up the situation quite well:
The legal community has been misled into thinking that differential privacy can offer the benefits of data research without sacrificing privacy. In fact, differential privacy will usually produce either very wrong research results or very useless privacy protections. Policymakers and data stewards will have to rely on a mix of approaches: perhaps differential privacy where it is well-suited to the task, and other disclosure prevention techniques in the great majority of situations where it isn’t.
A careful reassessment of the situation is urgently needed.