De-Identification: A Critical Debate
Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.
Why de-identification is a key solution for sharing data responsibly
Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)
Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)
Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight: De-Identification Does Work) by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true. Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.
Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.
It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification
Real-world evidence shows that the risk of re-identifying properly anonymized data is very small
Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.
It is possible to de-identify location data
The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.
Computing re-identification probabilities is not only possible, but necessary
The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.
Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.
The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.
We should consider realistic threats
The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.
The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow
The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.
Computing correct probabilities for the Heritage Health Prize data set
One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.
In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.
The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.
It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.
The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!
The authors do not propose alternatives
The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.
The authors pose a false dichotomy for the future
The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.
We strongly disagree with that presentation of the alternatives. First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.
Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).
Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.” We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data. We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.
A call to action
It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.
References
[1] K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.
[2] Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.
[3] S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.
[4] K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.
[5] L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.
[6] L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.
[7] K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.
[8] K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.
[9] K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.
[10] K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.