Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Differential privacy (DP) is an approach to maintaining the privacy of individual records in a database, while still allowing statistical analysis. It is now perceived as the go-to method in the data privacy area, enjoying adoption by the US Census Bureau and several major firms in industry, as well as a highly visible media presence. DP has developed a vast research literature. On the other hand, it is also the subject of controversy, and now, of lawsuits.
Some preparatory remarks:
I’ve been doing research in the data privacy area off and on for many years, e.g. IEEE Symposium on Security and Privacy; ACM Trans. on Database Systems; several book chapters; and current work in progress, arXiv. I was an appointed member of the IFIP Working Group 11.3 on Database Security in the 1990s. The ACM TODS paper was funded in part by the Census Bureau.
Notation: the database consists of n records on p variables.
The R package diffpriv makes use of standard DP methods easy to implement, and is recommended for any reader who wishes to investigate these issues further.
Here is a classical example of the privacy issue. Say we have an employee database, and an intruder knows that there is just one female electrical engineer in the form. The intruder submits a query on the mean salary of all female electrical engineers, and thus illicitly obtains her salary.
What is DP?
Technically DP is just a criterion, not a method, but the term generally is taken to mean methods whose derivation is motivated by that criterion.
DP is actually based on a very old and widely-used approach to data privacy, random noise perturbation. It’s quite simple. Say we have a database that includes a salary variable, which is considered confidential. We add random, mean-0 noise, to hide a person’s real income from intruders. DP differs from other noise-based methods, though, in that it claims a quantitative measure of privacy.
Why add noise?
The motivation is that, since the added noise has mean 0, researchers doing legitimate statistical analysis can still do their work. They work with averages, and the average salary in the database in noisy form should be pretty close to that of the original data, with the noise mostly “canceling out.” (We will see below, though, that this view is overly simple, whether in DP or classical noise-based methods.)
In DP methods, the noise is typically added to the final statistic, e.g. to a mean of interest, rather than directly to the variables.
One issue is whether to add a different noise value each time a query arrives at the data server, vs. adding noise just once and then making that perturbed data open to public use. DP methods tend to do the former, while classically the latter approach is used.
Utility:
A related problem is that for most DP settings, a separate DP version needs to be developed for every statistical method. If a user wants, say, to perform quantile regression, she must check whether a DP version has been developed and code made available for it. With classical privacy methods, once the dataset has been perturbed, users can apply any statistical method they wish. I liken it to an amusement park. Classical methods give one a “day pass” which allows one to enjoy any ride; DP requires a separate ticket for each ride.
Privacy/Accuracy Tradeoffs:
With any data privacy method, DP or classical, there is no perfect solution. One can only choose a “dial setting” in a range of tradeoffs. The latter come in two main types:
- There is a tradeoff between protecting individual privacy on the one hand, and preserving the statistical accuracy for researchers. The larger the variance of added noise, the greater the privacy but the larger the standard errors in statistical quantities computed from the perturbed data.
- Equally important, though rarely mentioned, there is the problem of attenuation of relationships between the variables. This is the core of most types of data analysis, finding and quantifying relationships; yet the more noise we add to the data, the weaker the reported relationships will be. This problem arises in classical noise addition, and occurs in some DP methods, such as ones that add noise to counts in contingency tables. So here we have not only a variance problem but also a bias problem; the absolute values of correlations, regression coefficients and so on are biased downward. A partial solution is to set the noise correlation structure equal to that of the data, but that doesn’t apply to categorical variables (where the noise addition approach doesn’t make much sense anyway).
Other classical statistical disclosure control methods:
Two other major data privacy methods should be mentioned here.
- Cell suppression: Any query whose conditions are satisfied by just one record in the database is disallowed. In the example of the female EE above, for instance, that intruder’s query simply would not be answered. One problem with this approach is that it is vulnerable to set-differencing attacks. The intruder could query the total salaries of all EEs, then query the male EEs, and then subtract to illicitly obtain the female EE’s salary. Elaborate methods have been developed to counter such attacks.
- Data swapping: For a certain subset of the data — either randomly chosen, or chosen according to a record’s vulnerability to attack — some of the data for one record is swapped with that of a similar record. In the female EE example, we might swap occupation or salaries, say.
Note that neither of these methods avoids the problem of privacy/accuracy tradeoffs. In cell suppression, the more suppression we impose, the greater the problems of variance and bias in stat analyses. Data swapping essentially adds noise, again causing variance and bias.
The DP privacy criterion:
Since DP adds random noise, the DP criterion is couched in probabilistic terms. Consider two datasets, D and D’, with the same variables and the same number of records n, but differing in 1 record. Consider a given query Q. Denote the responses by Q(D) and Q(D’). Then the DP criterion is, for any set S in the image of Q,
P(Q(D) in S) < P(Q(D’) in S) exp(ε)
for all possible (D,D’) pairs and for a small tuning parameter ε. The smaller ε, the greater the privacy.
Note that the definition involves all possible (D,D’) pairs; D here is NOT just the real database at hand (though there is a concept of local sensitivity in which D is indeed our actual database). On the other hand, in processing a query, we ARE using the database at hand, and we compute the noise level for the query based on n for this D.
DP-compliant methods have been developed for various statistical quantities, producing formulas for the noise variance as a function of ε and an anticipated upper bound on Δ = |Q(D) – Q(D’)|. Again, that upper bound must apply to all possible (D,D’) pairs. For human height, say, we know that no one will have height, say, 300 cm, which could take for Δ If our query Q() is for the mean height, we divide Δ by n; it’s a rather sloppy bound, but it would work.
Problems with DP’s claims of quantifiable guaranteed privacy:
(Many flaws have been claimed for DP, but to my knowledge, this analysis is new.)
Again consider the female EE example. One problem that arises right away is that, since this is a conditional mean, Q(D) and/or Q(D’) will be often be undefined.
It’s not clear that existing implementations of DP are prepared to handle this. E.g. consider the diffpriv package. It appears to do nothing to deal with this problem. The Google Differential Privacy Library uses SQL, we have trouble:
If database access is via SQL, the problem would result in returning NULL if the conditioning set is empty for some D or D’. Since such queries are central to data analysis, this would be a serious problem.
It’s not clear that the Census Bureau’s TopDown algorithm handles the problem either. They treat the data as a giant contingency table, adding noise to each cell. All queries are then processed as functions of the cell counts, with no further noise added.
A major issue appears to be that many cells that had non-0 counts originally will now be 0s in the noisy version of the data. The added random noise will produce negative numbers in many cases, and though the Bureau’s procedure changes those to at least 0, many will stay at 0. One then has the “0 denominator” issue described above.
Another issue is queries for totals, say the total salaries for all female workers. The noise that would be added, say based on maximum salary, would be the same regardless of whether the there are 10 women in the firm or 1000. While DP would give mathematically correct noise here, having the noise be the same regardless of the overall size of the total seems unwarranted.
Bias problems:
As noted, DP methods that work on contingency tables by adding independent noise values to cell counts can attenuate correlation and thus produce bias. The bias will be substantial for small tables.
Another issue, also in DP contingency table settings, is that bias may occur from post-processing. If Laplacian noise is added to counts, some counts may be negative. As shown in Zhu et al, post-processing to achieve nonnegativity can result in bias.
The US Census Bureau’s adoption of DP:
The Census Bureau’s DP methodology replaces the swapping-based approach used in past census reports. Though my goal in this essay has been mainly to discuss DP in general, I will make a few comments.
First, what does the Bureau intend to do? They will take a 2-phase approach. They view the database as one extremely large contingency table (“histogram,” in their terminology). Then they add noise to the cell counts. Next, they modify the cell counts to satisfy nonnegativity and certain other constraints, e.g. taking the total number of residents in a census block to be invariant. The final perturbed histogram is released to the public.
Why are they doing this? The Bureau’s simulations indicate that, with very large computing resources and possibly external data, an intruder could reconstruct much of the original, unperturbed data.
Criticisms of the Census Bureau’s DP Plan:
In addition to the general concerns about DP, there are also ones specific to the Bureau’s DP methods.
The Bureau concedes that the product is synthetic data. Well, isn’t any perturbed data synthetic? Yes, but here ALL of the data is perturbed, as opposed to swapping, where only a small fraction of the data changes.
Needless to say, then, use of synthetic data has many researchers up in arms. They don’t trust it, and have offered examples of undesirable outcomes, substantial distortions that could badly effect research work in business, industry and science. There is also concern that there will be serious impacts on next year’s congressional redistricting, which strongly relies on census data, though one analysis is more optimistic.
There has already been one lawsuit against the Bureau’s use of DP. Expect a flurry more, after the Bureau releases its data — and after redistricting is done based on that data. So it once again boils down to the privacy/accuracy tradeoff. Critics say the Bureau’s reconstruction scenarios are unlikely and overblown. Again, add to that the illusory nature of DP’s privacy guarantees, and the problem gets even worse.
Final comments:
How did we get here? DP has some serious flaws. Yet it has largely become entrenched in the data privacy field. In addition to being chosen as the basis of the census data, it is used somewhat in industry. Apple, for instance, uses classic noise addition, applied to raw data, but with a DP privacy budget.
As noted, early DP development was done mainly by CS researchers. CS people view the world in terms of algorithms, so that for example they feel very comfortable with investigating the data reconstruction problem and applying mathematical optimization techniques. But the CS people tend to have poor insight into what problems the users of statistical databases pursue in their day-to-day data analysis activities.
Some mathematical statisticians entered the picture later, but by then DP had acquired great momentum, and some intriguing theoretical problems had arisen for the math stat people to work on. Major practical issues, such as that of conditional quantities, were overlooked.
In any case, the major players in DP continue to be in CS. For example, all 10 of the signatories in a document defending the Census Bureau DP plan are in CS.
In other words, in my view, the statistician input into the development of DP came too little, too late. Also, to be frank, the DP community has not always been open to criticism, such as the skeptical material in Bambauer, Muralidhar and Sarathy.
Statistical disclosure control is arguably one of the most important data science issues we are facing today. Bambauer et al in the above link sum up the situation quite well:
The legal community has been misled into thinking that differential privacy can offer the benefits of data research without sacrificing privacy. In fact, differential privacy will usually produce either very wrong research results or very useless privacy protections. Policymakers and data stewards will have to rely on a mix of approaches: perhaps differential privacy where it is well-suited to the task, and other disclosure prevention techniques in the great majority of situations where it isn’t.
A careful reassessment of the situation is urgently needed.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.