Discussion of the Papers in the Session
``Addressing Risks to Confidentiality Using Microdata Files''

Stephen F. Roehrig
The Heinz School
Carnegie Mellon University

The three papers in this session really do deal with the topic "Addressing Risks to Confidentiality Using Microdata Files"; the first and third from the "bad-guy" perspective of restricting release, and the second from the "good-guy" perspective of enhancing access. So, congratulations to Ginny deWolfe, Steve Cohen, and the other organizers for getting papers that fit the session topic perfectly.

For the participants of this conference, the issue of confidentiality is near and dear. This session clearly shows us the two sides of the confidentiality coin. The paper by Linda Piccinino and William Mosher, about access, was conveniently, and no doubt deliberately, sandwiched between John Eltinge's and Paul Massell's papers, which are in some sense about reduction of access. I don't mean to suggest, however, that any of the authors is ignoring one or the other side of the coin.

John's paper, while primarily about an interesting new technique for reducing the risk of identification of primary sample units in a stratified survey, goes to some pains to pinpoint the loss of information to the data user as a consequence of adopting his ideas. Similarly, Paul is concerned here with risk measurement, but as his talk at the Joint Statistical Meetings in August, 1999 showed, he is no stranger to the tradeoff between risk reduction and information loss. And finally, Linda's and William's effort to make the National Survey of Family Growth available to a vastly wider audience includes, as it must, a somewhat elaborate mechanism for protecting confidentiality.

I needn't give a summary of these papers; they are available on this Web site.  What I might do though, is raise a few points that occurred me as I was reading the papers.

Something that consistently caught my attention when reading John's paper on stratum mixing was his ability to juggle both the reduction of matching probabilities and the consequent loss of inferential efficiency. He suggests that a goal is to reduce the matching probability to that of pure chance, that is, so a data intruder would do no better than pulling PSU data out of one hat, and its identifier out of another.

This is countered by the desire to minimize the distortion of estimators (primarily variance estimators). This looks like a classic goal programming problem, and it would be interesting to see it played out with a real data set, perhaps the NHIS data. In this context, at least, it looks as though the relevant constraints are well quantified, as are the possible outcomes (the distinctiveness of the PSU profile vectors and the consequent identification risk, as well as the loss to the legitimate data user).

Pushing this just a little further, it might be interesting to see where current practices fall on this tradeoff curve. In the conventional risk literature, one way of determining society's value of a human life is to look at a broad spectrum of life-threatening risks, and the funds allocated to reduce them. One might imagine a similar program, looking at disclosure risks across survey releases. This naturally assumes we have ways of quantifying both the risks and the benefits, which is just what John is doing.

Moving to Linda's and William's paper, my eye was first drawn to the really wonderful way in which data users are accommodated, specifically through the use of an initial dummy data file and documentation. With this, a user can work the bugs out of an analysis on his own time. For agencies and programs that work closely with outside researchers, and who likely waste lots of time (whether billed or not) making that "final change to get just the right data,'' this seems like an ideal approach.

The natural questions are, of course, how much was invested in building the dummy data set, and how well can it anticipate the range of analyses the real data might be put to. The paper mentions that honest-to-goodness external users were involved in the design of the access system, and this can only build confidence with respect to the latter question. At any rate, systems managed this carefully are bound to reduce the kind of horror stories I sometimes hear from my economist colleagues at the public policy school I work at.

In this same paper, another eye was drawn to the confidentiality aspects of the access system. Three points in particular: First, how much of an overload is the real-time processing of primary and complementary suppressions? If this can be managed effectively (that is, efficient suppression patterns and thorough coverage), this opens up new possibilities for other existing and proposed access systems, particularly Web-based ones.

Second, the flip side is the danger of attack from multiple queries. To what extent are the real-time suppressions generated with only the current query in mind? Can trackers be made to work? This leads to a third question, which is to what extent can an actual disclosure, perhaps even one made public, be traced back to the queries issued by bona fide users?

Moving to Paul's paper, the focus is again on re-identification, in this case for both the individual and group levels. At the individual level, sample and population uniques are the usual suspects. Paul is right to point out that it is difficult to place a confidence interval around an estimate of population uniques from sample data. For Paul, the desired measure is precisely defined, but the accuracy of its estimate is uncertain. For John, on the other hand, his "incremental identification risk" (equation 3.2 in his paper) has a variety of summary evaluations: its maximum over all questionnaire elements for a PSU, or some quantile, or... Thus both are hard to pin down exactly.

For group re-identification, Paul turns to an entropy measure. The idea is to measure the difference P(X)-P(X | i) where X is some attribute and i is a property of belonging to a group. In the special case where X is discrete, a Shannon-type entropy measure is used to summarize P(X)-P(X | i).  A natural question here is what gets lost in this summarization? Is there useful information in P(X)-P(X | i) that is lost in examining just the entropy? I don't know how to answer this, but I suspect Paul and others could provide some insight.

A final point raised by Paul concerns the availability of external data files that match official statistical releases on one or more keys. For the American Housing Survey, and the key (age, sex, salary($2000)) there were 847 uniques in the set of 102,761 records. Such a key may be widely available from proprietary data sets. But what about other keys? Latanya Sweeney's well-known report of 95% uniques for the key (age and zip code) in Cambridge MA voter registration is rather chilling. Paul's percentage of uniques also goes way up when looking at households, and he notes that publicly available drivers license information is perfect for household re-identification.

Is there a uniform understanding among statistical agencies on whether disclosure resulting from linking to external files violates their privacy mandates? Or are disclosure rules framed solely in terms a specific publication as a stand-alone unit? This is a critical question for the future, as the Web, and our collective thirst for information, is here to stay.

In sum, I was pleased to have the opportunity to read and discuss the papers in this session. Quite properly, they both answered and raised questions.