The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study.
In this posting, I’ll briefly explain what SDL is, and then describe a new method that Pat Tendick and I are proposing. Our paper is available as arxiv:1510.04406 and R code to implement the method is available on GitHub. See the paper for details.
This is a very difficult problem, one that arguably has not been fully solved, in spite of decades of work by some really sharp people. Some of the common methods are: adding mean-0 noise to each variable; finding pairs of similar records and then swapping their values of the sensitive variables; and (in the case in which all variables are categorical), suppressing cells that contain just 1 or a few cases.
As an example of the noise addition method, consider a patient data set that includes the variables Age and Income. Suppose a nefarious user of the data happens to have external knowledge that James is the oldest patient in the study. The Bad Guy can then issue a query asking for the income of the oldest patient (not mentioning James), thus revealing James’ salary. But if the public version of the data has had noise added, James’s listed income will not be his real one, and he may well not be the oldest listed patient anymore anyway.
Given the importance of this topic — JSM 2015 had 3 separate sessions devoted to it — it is surprising that rather little public-domain software is available. The only R package I know of is sdcMicro on CRAN (which by the way includes an excellent vignette from which you can learn a lot about SDL). NISS has the Java-based WebSwap (from whose docs you can also learn about SDL).
Aside from the availability of software, one big concern with many SDL methods is that the multivariate structure of the data may be distorted in the modification process. This is crucial, since most statistical analyses are multivariate in nature, e.g. regression, PCA etc., and thus a major distortion in the multivariate structure can result in seriously misleading estimates.
In the noise addition method, this can be achieved by setting the noise covariance matrix to that of the original data, but for the other methods maintaining the proper multivariate structure is a challenge.
While arguably noise addition works well for data consisting only of continuous variables, and data swapping and cell suppression are often acceptable for the purely-categorical case, the mixed continuous-categorical setting is tough.
Our new method achieves both of the above goals. It (a) is applicable to any kind of data, including the mixed continuous-categorical case, and (b) maintains the correct multivariate structure. Rather counterintuitively, our method achieves (b) while actually treating the variables as (conditionally) independent.
The method has several tuning parameters. In some modern statistical methods, tuning parameters are a real pain, but in SDL, the more tuning parameters the better! The database administrator needs to have as many ways as possible to develop a public form of the database that has both good statistical accuracy and good privacy protection.
As an example, I took some Census data for the year 2000 (5% PUMS), involving programmers in Silicon Valley. In order to simulate an employee database, I sampled 5000 records, keeping the variables WageIncome, Age, Gender, WeeksWorked, MSDegree and PhD. See our paper for details, but here is a quick overview.
First, to see that goal (b) above has been maintained reasonably well, I ran a linear regression analysis, predicting WageIncome from the other variables. I did this twice, once for the original data and once for the modified set, for a given combination of values of the tuning parameters. Here are the estimated regression coefficients:
This is not bad. Each pair of coefficients is within one original standard error of the other (not shown). The database administrator could try lots of other combinations of the tuning parameters, and likely get even closer. But what about privacy?
In the original data set, there was exactly one female worker with age under 31:
> p1[p1$sex==2 & p1$phd==1 & p1$age < 31,] age sex wkswrkd ms phd wageinc 7997 30.79517 2 52 0 1 100000
Which such workers, if any, are listed in the modified data?
> p1pc p1pc[p1pc$sex==2 & p1pc$phd==1 & p1pc$age < 31,] age sex wkswrkd ms phd wageinc 12522 30.5725 2 52 0 1 50000
There is only one person listed in the released data of the given description (female, PhD, age under 31). But she is listed as having an income of $50,000 rather than $100,000. In fact, it is a different person, worker number 12522, not 7997. (Of course, ID numbers would be suppressed.)
So what happened to worker number 7997?
> which(rownames(p1p) == 7997)
age sex wkswrkd ms phd wageinc
7997 31.9746 1 52 0 1 100000
Ah, she became a man! That certainly hides her. Under another luck of the draw, her record may have become all NA values.
In this way, the database administrator can set up a number of statistical analysis test cases, and a number of records at high risk of identification, and then try various combinations of the tuning parameters in order to obtain a modified data set that achieves a desired balance between statistical fidelity and privacy.