Making Perfect De-Identification the Enemy of Good De-Identification


This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).