De-Identification: A Critical Debate

Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.

Why de-identification is a key solution for sharing data responsibly

Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)

Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)

Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight:  De-Identification Does Work)  by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true.  Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.

Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.

It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification

Real-world evidence shows that the risk of re-identifying properly anonymized data is very small

Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.

It is possible to de-identify location data

The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.

Computing re-identification probabilities is not only possible, but necessary

The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.

Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.

The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.

We should consider realistic threats

The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.

The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow

The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.

Computing correct probabilities for the Heritage Health Prize data set

One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.

In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.

The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.

It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.

The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!

The authors do not propose alternatives

The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.

The authors pose a false dichotomy for the future

The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.

We strongly disagree with that presentation of the alternatives.  First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.

Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).

Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.”  We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data.  We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.

A call to action

It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.

References

[1]          K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.

[2]          Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.

[3]          S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.

[4]          K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.

[5]          L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.

[6]          L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.

[7]          K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.

[8]          K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.

[9]          K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.

[10]        K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.

Privacy Chutzpah: A Story for the Onion?

I recently received an email promoting a campaign by a group called Some Of Us, an organization that generates petitions opposing various activities of large companies. This campaign was directed at Facebook, calling on the social network to not sell user data to advertisers. Facebook has recently announced plans to allow advertisers to target ads to Facebook users based on the web sites users have visited. Facebook is not selling user data to advertisers, but I can understand the confusion. Behavioral advertising is complicated, and although selling user data to advertisers is very different than choosing ads for users based on their web surfing, it’s not uncommon for critics to use broad language to blast targeted ads in general.

How Ads Work on Facebook from Facebook on Vimeo.

The surprise was what I found when I examined the privacy policy for the Some of Us site. In a move worthy of an Onion fake news story, the Some of Us policy discloses that it works with ad networks to retarget ads to users on the web after they visit the Some of Us site. Yup! Some of Us does exactly what it is calling on users to protest to Facebook. A quick scan of the site using popular tracking cookie scanner Ghostery finds the code for several ad companies, including leading data broker Axciom.

Some of Us also complains that the Facebook opt-out process, where Facebook links users to the industry central opt-out site found at aboutads.info, is too tedious. But Some of Us doesn’t even bother to provide its visitors with a link or an url to opt-out, as the behavioral advertising code enforced by the Better Business Bureau requires. Some of Us just tells visitors they can visit the Network Advertising Initiative opt out page, leaving them to research how to find the opt-out page on their own.

It gets better. Some of Us solicits users emails and names for petitions, but only if you read the site privacy policy will you learn that signing the petition adds you to the email list for future emails from Some of Us about other causes. The site privacy policy also explains the use of email web bugs that enable Some of Us to personally track when and if the recipients of emails open and read the emails.

I am used to reading stories in the media blasting behavioral ads on the home pages of newspapers embedded with dozens of web trackers. Reporters don’t run the web sites of newspapers, and although they might want to consider whether the ad tracking they consider odious is funding their salaries, they can credibly argue that the business side of media and reporting are separate worlds. But how can an advocacy group blast behavioral ads while targeting behavioral ads to users who come to sign a petition against behavioral ads?!!!

I signed the petition and was immediately taken to a page where Some of Us encouraged me to share the news with my friends on Facebook.

-Jules Polonetsky, Executive DirectorThis post originally appeared on LinkedIn

Mexico Takes Step Toward Data Privacy Interoperability

Last week, the Mexican Institute for Federal Access to Information (IFAI) hosted an event in Mexico City to discuss the recently-announced “Parameters of Self-Regulation for the Protection of Personal Data.”  FPF participated in this workshop along with representatives from the Mexican government, TRUSTe, EuroPriSe and the Better Business Bureau.

As described in opening remarks by the Secretary for Data Protection, under the new regulation, IFAI now has the authority to recognize codes of conduct for data protection and has developed a process through which an organization can be recognized as a certifying body for these codes.  Under the new regulation, the Mexican Accreditation Agency will make a determination on applicant organizations against a set recognition criteria.  Successful applicants will then receive formal recognition as certifying entities from the Ministry of the Economy.

This approach mirrors the process developed as part of the Asia Pacific Economic Cooperation’s (APEC) Cross Border Privacy Rules (CBPR) system in several key ways.  First, the certifying organizations contemplated under this approach serve the same function as “Accountability Agents” under the CBPR system.  In addition, both approaches require a formal recognition based on established criteria.  And second, the standards to which these organizations will be certifying companies are both keyed to Mexico’s Federal Law on the Protection of Personal Information  (the legal basis for Mexico’s participation in the CBPR system).  Given these parallels in both process and substance, a company that receives CBPR certification in Mexico should also be able to attain recognition under this approach.  But perhaps most importantly, CBPR certification should allow a company to avail itself of the incentives offered under Mexican law.

Article 68 of the implementing regulations of the privacy law encourages the development of self-regulatory frameworks and states that participation in a recognized framework (such as the CBPR system) will be taken into account in order to determine any reduction in sanctions determined by IFAI in the event of a violation of the privacy law.

What makes this development so critical to global interoperability is that it serves as a model for other APEC member economies to consider how an enforceable code of conduct based on an international standard can be successfully incorporated into a legal regime – including extending express benefits to certified companies.  It remains to be seen how other APEC economies  will manage this task – but Mexico’s approach offers a promising start.

-Josh Harris, Policy Director 

"Gambling? In This Casino?" Jules and Omer on the Facebook Experiment

Today, Re/code ran an essay by Jules Polonetsky and Omer Tene, offering their take on the Facebook’s now-infamous experiment looking at the effects of tweaking the amount of positive or negative comments on a user’s News Feed:

As the companies that serve us play an increasingly intimate role in our lives, understanding how they shape their services to influence users has become a vexing policy issue. Data can be used for control and discrimination or utilized to support fairness and freedom. Establishing a process for ethical decision-making is key to ensuring that the benefits of data exceed their costs.