Cross Border Privacy Rules Advance at Beijing Meetings

APEC’s Data Privacy Subgroup concluded its 2014 meetings in Beijing, China earlier this week.   The Future of Privacy Forum participated in these meetings as a member of the U.S. delegation.  The biggest development of the week was Canada’s submission of its Notice of Intent to participate in the Cross Border Privacy Rules (CBPR) system.  After a favorable determination by the APEC’s Joint Oversight Panel, Canada will become the fourth country to join the system, along with the United States, Mexico and Japan.   In addition, TRUSTe, an APEC-approved Accountability Agent, announced that 14 companies are in the process of seeking certification.  Taken together, these developments, along with Mexico’s recent steps toward interoperability have provided promising momentum in the establishment of an international privacy framework.

Still much work remains before the true potential of the system can be fully realized.  In July, FPF hosted officials from Privacy Thailand, a University-based consortium that advises the Thai Prime Minister’s office on data privacy and security issues.  During their week-long visit, FPF and Privacy Thailand met with representatives from the Department of Commerce, the Federal Trade Commission and the U.S. Department of State to consider Thailand’s accession to the system.   FPF will continue work with interested APEC members to provide capacity building assistance.

On August 8, APEC Economies and representatives from the EU’s Article 29 Working Party met to discuss next steps on the jointly developed Common Referential.  This document identifies points of commonality between the CBPR system and the EU’s system of Binding Corporate Rules (BCRs).  APEC members agreed to take this work forward by developing case studies that demonstrate the practical interoperability of these two systems and a checklist outlining the combined obligations for a company seeking certification under both.

On August 10, APEC Economies agreed to establish a working group to consider the applicability of the APEC Privacy Framework to Big Data.  This group will consider, among other things, appropriate administrative and policy safeguards when de-identifying personal information.  FPF plans to participate in this working group.

Participants continued the development of a CBPR certification system for data processors.   In July, FPF hosted a meeting of this working group to develop the program requirements under this certification.  Completion of this project is expected in advance of the next APEC Data Privacy Subgroup meetings in Clark, Philippines in January, 2015.

Comments to NTIA on Big Data and Privacy

Today, FPF submitted comments to the NTIA as it begins its exploration of how big data impact the Consumer Privacy Bill of Rights. While the NTIA sought comment on over a dozen key questions, our filing focus largely on four issues: (1) the need for additional clarity surrounding the flexible application of the Consumer Privacy Bill of Rights’ privacy principles, (2) challenges to the “notice and choice” model and using context to inform a use-based approach to data use, (3) practical de-identification, and (4) what internal review boards might look like and consider in the age of big data.

Much of our filing builds upon FPF’s thinking on how to develop a benefit-risk analysis for data protects, with big data concerns of particular importance. Industry increasingly faces ethical considerations over how to minimize data risks while maximizing benefits to all parties. As the White House’s earlier Big Data Report acknowledged, there is a potential tension between socially beneficial and privacy invasive uses of information in everything from educational technology to consumer generated health data. The advent of big data requires active engagement by both internal and external stakeholders to increase transparency, accountability and trust.

FPF believes that a documented review process could serve as an important tool to infuse ethical considerations into data analysis without requiring radical changes to the business practices or innovators or industry in general. Institutional review boards (IRBs), which remain the chief regulatory response to decades of questionable ethical decisions in the field of human subject testing, provide a useful precedent for focusing on good process controls as a way to address potential privacy concerns. While IRBs have become a rigid compliance device and would be inappropriate for wholesale use in big data decision-making, they could provide a useful template for how projects can be evaluated based on prevailing community standards and subjective determinations of risks and benefits, particularly in cases involving greater privacy risks. Using an IRB model as inspiration, big data may warrant the creation of new advisory processes within organizations to more fully consider ethical questions posed by big data.

Moving forward, broader big data ethics panels could provide a commonsense response to public concerns about data misuse. While these institutions could provide a further expansion of the role of privacy professionals within organizations, they might also provide a forum for a diversity of viewpoints inside and out of organizations. Ethics reviews could include members with different backgrounds, training, and experience, and could seek input from outside actors including consumer groups and regulators.While these panels will vary between the public and private sector, businesses and researchers, they could provide an important check on any data misuse.

Organizations and privacy professionals have become experienced at evaluating risk, but they should also engage in a rigorous data benefit analysis in conjunction with traditional privacy risks assessments. FPF suggests that organizations could develop procedures to assess the “raw value” of a data project, which would require organizations to identify the nature of a project, its potential beneficiaries, and the degree to which those beneficiaries would benefit from the project. Our guidance for this process is included in our filing for the first time.

Of course, big data hasn’t changed all the rules. And not every use of big data implicates our privacy. Many uses of big data are machine-to-machine or highly aggregated. Many new uses of data are marginal, which our current processes for mitigating risks can well address.

De-Identification: A Critical Debate

Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.

Why de-identification is a key solution for sharing data responsibly

Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)

Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)

Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight:  De-Identification Does Work)  by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true.  Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.

Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.

It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification

Real-world evidence shows that the risk of re-identifying properly anonymized data is very small

Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.

It is possible to de-identify location data

The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.

Computing re-identification probabilities is not only possible, but necessary

The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.

Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.

The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.

We should consider realistic threats

The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.

The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow

The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.

Computing correct probabilities for the Heritage Health Prize data set

One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.

In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.

The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.

It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.

The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!

The authors do not propose alternatives

The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.

The authors pose a false dichotomy for the future

The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.

We strongly disagree with that presentation of the alternatives.  First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.

Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).

Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.”  We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data.  We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.

A call to action

It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.

References

[1]          K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.

[2]          Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.

[3]          S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.

[4]          K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.

[5]          L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.

[6]          L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.

[7]          K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.

[8]          K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.

[9]          K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.

[10]        K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.

Privacy Chutzpah: A Story for the Onion?

I recently received an email promoting a campaign by a group called Some Of Us, an organization that generates petitions opposing various activities of large companies. This campaign was directed at Facebook, calling on the social network to not sell user data to advertisers. Facebook has recently announced plans to allow advertisers to target ads to Facebook users based on the web sites users have visited. Facebook is not selling user data to advertisers, but I can understand the confusion. Behavioral advertising is complicated, and although selling user data to advertisers is very different than choosing ads for users based on their web surfing, it’s not uncommon for critics to use broad language to blast targeted ads in general.

How Ads Work on Facebook from Facebook on Vimeo.

The surprise was what I found when I examined the privacy policy for the Some of Us site. In a move worthy of an Onion fake news story, the Some of Us policy discloses that it works with ad networks to retarget ads to users on the web after they visit the Some of Us site. Yup! Some of Us does exactly what it is calling on users to protest to Facebook. A quick scan of the site using popular tracking cookie scanner Ghostery finds the code for several ad companies, including leading data broker Axciom.

Some of Us also complains that the Facebook opt-out process, where Facebook links users to the industry central opt-out site found at aboutads.info, is too tedious. But Some of Us doesn’t even bother to provide its visitors with a link or an url to opt-out, as the behavioral advertising code enforced by the Better Business Bureau requires. Some of Us just tells visitors they can visit the Network Advertising Initiative opt out page, leaving them to research how to find the opt-out page on their own.

It gets better. Some of Us solicits users emails and names for petitions, but only if you read the site privacy policy will you learn that signing the petition adds you to the email list for future emails from Some of Us about other causes. The site privacy policy also explains the use of email web bugs that enable Some of Us to personally track when and if the recipients of emails open and read the emails.

I am used to reading stories in the media blasting behavioral ads on the home pages of newspapers embedded with dozens of web trackers. Reporters don’t run the web sites of newspapers, and although they might want to consider whether the ad tracking they consider odious is funding their salaries, they can credibly argue that the business side of media and reporting are separate worlds. But how can an advocacy group blast behavioral ads while targeting behavioral ads to users who come to sign a petition against behavioral ads?!!!

I signed the petition and was immediately taken to a page where Some of Us encouraged me to share the news with my friends on Facebook.

-Jules Polonetsky, Executive DirectorThis post originally appeared on LinkedIn

Mexico Takes Step Toward Data Privacy Interoperability

Last week, the Mexican Institute for Federal Access to Information (IFAI) hosted an event in Mexico City to discuss the recently-announced “Parameters of Self-Regulation for the Protection of Personal Data.”  FPF participated in this workshop along with representatives from the Mexican government, TRUSTe, EuroPriSe and the Better Business Bureau.

As described in opening remarks by the Secretary for Data Protection, under the new regulation, IFAI now has the authority to recognize codes of conduct for data protection and has developed a process through which an organization can be recognized as a certifying body for these codes.  Under the new regulation, the Mexican Accreditation Agency will make a determination on applicant organizations against a set recognition criteria.  Successful applicants will then receive formal recognition as certifying entities from the Ministry of the Economy.

This approach mirrors the process developed as part of the Asia Pacific Economic Cooperation’s (APEC) Cross Border Privacy Rules (CBPR) system in several key ways.  First, the certifying organizations contemplated under this approach serve the same function as “Accountability Agents” under the CBPR system.  In addition, both approaches require a formal recognition based on established criteria.  And second, the standards to which these organizations will be certifying companies are both keyed to Mexico’s Federal Law on the Protection of Personal Information  (the legal basis for Mexico’s participation in the CBPR system).  Given these parallels in both process and substance, a company that receives CBPR certification in Mexico should also be able to attain recognition under this approach.  But perhaps most importantly, CBPR certification should allow a company to avail itself of the incentives offered under Mexican law.

Article 68 of the implementing regulations of the privacy law encourages the development of self-regulatory frameworks and states that participation in a recognized framework (such as the CBPR system) will be taken into account in order to determine any reduction in sanctions determined by IFAI in the event of a violation of the privacy law.

What makes this development so critical to global interoperability is that it serves as a model for other APEC member economies to consider how an enforceable code of conduct based on an international standard can be successfully incorporated into a legal regime – including extending express benefits to certified companies.  It remains to be seen how other APEC economies  will manage this task – but Mexico’s approach offers a promising start.

-Josh Harris, Policy Director 

"Gambling? In This Casino?" Jules and Omer on the Facebook Experiment

Today, Re/code ran an essay by Jules Polonetsky and Omer Tene, offering their take on the Facebook’s now-infamous experiment looking at the effects of tweaking the amount of positive or negative comments on a user’s News Feed:

As the companies that serve us play an increasingly intimate role in our lives, understanding how they shape their services to influence users has become a vexing policy issue. Data can be used for control and discrimination or utilized to support fairness and freedom. Establishing a process for ethical decision-making is key to ensuring that the benefits of data exceed their costs.

FPFcast: Stalking and the Location Privacy Protection Act with Cindy Southworth

June 30, 2014: Stalking and the Location Privacy Protection Act

[audio

In this podcast, FPF Policy Counsel Joseph Jerome talks with Cindy Southworth from the National Network to End Domestic Violence about stalking apps and how Senator Franken’s proposed bill might curtail their use.

Click on the media player above to listen, or download the complete podcast here.

Synopsis: Education Privacy Hearing—How Data Mining Threatens Student Privacy

Yesterday, the House of Representatives Education Subcommittee on Early Childhood, Elementary, and Secondary Education and the Homeland Security’s Subcommittee on Cybersecurity, Infrastructure Protection, and Security Technologies held a joint hearing to discuss “How Data Mining Threatens Student Privacy.”

Four witnesses presented testimony from a number of perspectives:

(1) Joel R. Reidenberg, Chair and Professor of Law and Founding Academic Director of the Center on Law and Information Policy at Fordham University School of Law; (2) Mark MacCarthy, Vice President of Public Policy at the Software and Information Industry Association; (3) Joyce Popp, Chief Information Officer at the Idaho State Department of Education; and (4) Thomas Murray, State and District Digital Learning Policy and Advocacy Director at the Alliance for Excellent Education.

Rep. Patrick Meehan, Chairman of the Cybersecurity Subcommittee, opened the hearing by noting that technology is increasingly used in a positive way to enhance student learning both in-and-out of the classroom, which was echoed by Rep. Todd Rokita in his opening remarks. The Subcommittees’ Ranking Members, Reps. Yvette Clark and Dave Loebsack, honed in on privacy concerns. They specifically cited the need to ensure that companies contracted to examine student data for the purpose of improving individual learning are not also scanning the data for improper commercial gain.

Each witness made strong points in their opening testimony.  Joel Reidenberg highlighted many of the themes in his December 2013 study about school contracts with third party service providers, and regulatory gaps in FERPA and COPPA.  Mark MacCarthy provided the industry point of view, by noting that presently there are significant protections in place for student data through existing federal laws, state efforts, and contract protections between schools and vendors.  He explained that although FERPA is an old law, it has been updated a number of times with additional guidance by the Department of Education—which industry members abide by.  Joyce Popp brought a unique and practical perspective to the hearing.  She discussed practices that have been well received and effective in Idaho, such as a state policy demanding that schools document student data collection and provide notice to parents through their websites.  Additionally, Popp emphasized Idaho Senate Bill 1372, ending the practice of allowing public education vendors to use the verbiage “own” as it related to student data.  Finally, Thomas Murray began by explaining that too few students’ graduate high school on time, and argued that this could be combated by enabling teachers to use individual student data to keep more students on track.

During the question-and-answer phase of the hearing, the Chairs and Ranking Members asked a number of questions.  Chairman Meehan was concerned about just how much information is getting into the hands of third parties, and wondered what could be done to ensure this information was not used to make potential hiring decisions about students after graduation.  Ranking Member Clark acknowledged that most vendor companies are probably not doing anything wrong, but broached the difficult topic of how to regulate potential bad actors.  Chairman Rokita returned to Idaho Senate Bill 1372, and the potential of using it as model language.  He also sought more information about using Title II funds to support oversight and enforcement of student privacy rules and regulations within schools.  Ranking Member Loebsack focused his questions on finding a balance between innovation and privacy.  He pressed on the issue that this is not an “either/or,” but rather an “and” game—and that because we must use data to improve education, we must also demand greater accountability from teachers, schools, and third party vendors.  In other words, increased data collection and use requires increased data protection and security.

Representatives Roe and Bonamici also chimed in to the conversation.  Rep. Roe noted that data mining takes place everywhere, citing his own supermarket saver card as an example. Rep. Bonamici responded by saying that even though data is collected everywhere today, the education space presents special consideration. Student data collection should be treated differently because it is not always clear that collection is occurring, and the content of the information is highly sensitive and about minors.

Additionally, Members sought to resolve differences in testimony given by Joel Reidenberg and Mark MacCarthy.  While MacCarthy stated that no new federal legislation is necessary because plenty of penalties already exist, Reidenbergnoted that protections under FERPA are limited. He pointed to the fact that the law’s penalties have never been used against a single school. On the issue of contracts, both witnesses seemed to agree that many school vendor contracts do not expressly prohibit third party commercial use or sharing, but both also agreed that there is no evidence that vendors are actually using student information inappropriately.

In closing remarks, some believed that the fact that people clearly disagree about what present law covers is evidence that Congress has a place to review the state of student privacy regulation to determine what, if anything, needs to be done.  Thomas Murray got the final word—where he reminded everyone of the enormous benefits already emerging from education tech, and pled that whatever the next step is that it should not stifle innovation.

FPF Statement on Today's Joint Subcommittee Hearing on Education Privacy

One of the most important sections of the Administration’s recent report on Big Data concerns was focused on education technology and privacy. The report noted the need to ensure that innovations in educational technology, including new approaches and business models, have ample opportunity to flourish.

Many of these benefits include robust tools to improve teaching and instructional methods; diagnose students’ strengths and weaknesses and adjust materials and approaches for individual learners; identify at-risk students so teachers and counselors can intervene early; and rationalize resource allocation and procurement decisions. Today, students can access materials, collaborate with each other, and complete homework all online.

Some of these new technologies and uses of data raise privacy concerns. Schools may not have the proper contracts in place to protect data and restrict uses of information by third parties. Many school officials may not even have an understanding of all the data they hold. As privacy expert Daniel Solove has noted, privacy infrastructure in K-12 schools is lacking. Without this support, some schools and vendors may not understand their obligations under student privacy laws such as COPPA, FERPA, and PPRA.

The Future of Privacy Forum believes it is critical that schools are provided with the help needed to build the capacity for data governance, training of essential personnel, and basic auditing. Schools must ensure additional data transparency to engender trust, tapping into innovative solutions such as digital backpacks, and providing parent friendly communications that explain how technology and data are used in schools.

Representatives Jared Polis and Luke Messer have called for bipartisan action on student data privacy, and the Future of Privacy Forum looks forward to working with them on their efforts.

Without measures to help parents see clearly how data are used to help their children succeed, the debate about data in education will remain polarized. With such measures in place, ed tech can be further harnessed to bridge educational inequalities, better tailor solutions for individual student needs, and provide objective metrics for measurement and improvement.

Striking a nuanced and thoughtful balance between harnessing digital innovation in education, while taking into account the need to protect student privacy, will help ensure trust, transparency, and progress in our education paradigm for years to come.

-Jules Polonetsky, Executive Director

Making Perfect De-Identification the Enemy of Good De-Identification

This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).