Page 115 – Future of Privacy Forum

Data Protection Law Errors in Google Spain LS, Google Inc. v. Agencia Espanola de Proteccion de Datos, Mario Costeja Gonzalez

The following is a guest post by Scott D. Goss, Senior Privacy Counsel, Qualcomm Incorporated, addressing the recent “Right to be Forgotten” decision by the European Court of Justice.

There has been quite a bit of discussion surrounding the European Court of Justice’s judgment in Google Spain LS, Google Inc. v. Agencia Espanola de Proteccion de Datos (AEPD), Mario Costeja Gonzalez. In particular, some interesting perspectives have been shared by Daniel Solove, Ann Cavoukian and Christopher Wolf, and Martin Husovec. The ruling has been so controversial, newly appointed EU Justice Commissioner, Martine Reicherts delivered a speech defending it. I’d like to add to the discussion.[1] Rather than focusing on the decision’s policy implications or on the practicalities of implementing the Court’s ruling, I’d like to instead offer thoughts on a few points of data protection law.

To start, I don’t think “right to be forgotten” is an apt description of the decision, and instead distorts the discussion. Even if Google were to follow the Court’s ruling to the letter, the information doesn’t cease to exist on the Internet. Rather, the implementation of the Court’s ruling just makes internet content linked to peoples’ names harder to find. The ruling, therefore, could be thought of as, “the right to hide”. Alternatively, the decision could be described as, “the right to force search engines to inaccurately generate results.” I recognize that such a description doesn’t roll off the tongue quite so simply, but I’ll explain why that description is appropriate below.

I believe the Court made a few important legal errors that should be of interest to all businesses that process personal data. First was the Court’s determination that Google was a “controller” as defined under EU data protection law and second was the application of the information relevance question. Then, I’ll explain why “the right to force search engines to inaccurately generate results” may be a more appropriate description of the Court’s ruling.

1. “Controller” status must be determined from the activity giving rise to the complaint

To understand how the Court erred in determining that Google is a “controller” in this case, it helps to understand how search engines work. At a conceptual level, search is comprised of three primary data processing activities: (i) caching all the available content, (ii) indexing the content, and (iii) ranking the content. During the initial caching phase, a search engine’s robot minions scour the Internet noting all the content on the Internet and its location. The cache can be copies of all or parts of the web pages on the Internet. The cache is then indexed to enable much faster searching by sorting the content. Indexing is important because without it searching would take immense computing power and significant time for each page of the Internet to be examined for users’ search queries. Finally, the content within the index is ranked for relevance.

From a data processing perspective, I believe that caching and indexing achieve two objective goals: Determining the available content of the internet and where can it be found. Tellingly, the only time web pages are not cached and indexed is when website publishers, not search engines, include a special code on their web pages instructing search engines to ignore the page. This special code is called robots.txt

The web pages that are cached and indexed could be the text of the Gettysburg address, the biography of Dr. Martin Luther King, Jr., the secret recipe for Coca-Cola, or newspaper articles that include peoples’ names. It is simply a fact that the letters comprising the name “Mario Costeja Gonzalez” could be found on certain web pages. Search engines cannot control that fact any more than they could take a picture of the sky and be said to control the clouds in the picture.

After creating the cache and index, the next step involves ranking the content. Search engine companies employ legions of the world’s best minds and immense resources to determine rank order. Such ranking is subjective and takes judgment. Arguably, ranking search results could be considered a “controller” activity, but the ranking of search results was never at issue in the Costeja Gonzalez case. This is a key point underlying the Court’s errors. Mr. Costeja Gonzalez’s complaint was not that Google ranked search results about him too high (i.e., Google’s search result ranking activity), but rather that the search engine indexed the information at all. The appropriate question, therefore, is whether Google is the “controller” of the index. The question of whether Google’s process of ranking search results confers “controller” status on Google is irrelevant. The Court’s error was to conflate Google’s activity of ranking search results with its caching and indexing of the Internet.

Some may defend the Court by arguing that controller status of some activities automatically anoints controller status on all activities. This would be error. The Article 29 Working Party opined,

[T]he role of processor does not stem from the nature of an entity processing data but from its concrete activities in a specific context. In other words, the same entity may act at the same time as a controller for certain processing operations and as a processor for others, and the qualification as controller or processor has to be assessed with regard to specific sets of data or operations.

Opinion 1/2010 on the concepts of “controller” and “processor”, page 25, emphasis added. In this case, Mr. Costeja Gonzalez’s complaint focused on the presence of certain articles about him in the index. Therefore, the “concrete activities in a specific context” is the act of creating the index and the “specific sets of data” is the index itself. The Article 29 WP went on to give an example of an entity acting as both a controller and a processor of the same data set:

An ISP providing hosting services is in principle a processor for the personal data published online by its customers, who use this ISP for their website hosting and maintenance. If however, the ISP further processes for its own purposes the data contained on the websites then it is the data controller with regard to that specific processing.

I submit that creation of the index is analogous to an ISP hosting service. In creating an index, search engines create a copy of everything on the Internet, sort it, and identify its location. These are objective, computational exercises, not activities where the personal data is noted as such and treated with some separate set of processing. Following the Article 29 Working Party opinion, search engines could be considered processors in the caching and indexing of Internet content because such activities are mere objective and computational exercises, but controllers in the ranking of the content due to the subjective and independent analysis involved.

Further, as argued in the Opinion of Advocate General Jaaskinen, a controller needs to recognize that they are processing personal data and have some intention to process it as personal data. (See paragraph 82). It is the web publishers who decide what content goes into the index. Not only do they have discretion in deciding to publish the content on the Internet in the first instance, but they also have the ability to add the robots.txt code to their web pages which directs search engines to not cache and index. The mark of a controller is one who “determines the purposes and means of the processing of personal data.” (Art. 2, Dir. 95/46 EC). In creation of the index, rather than “determining”, search engines are identifying the activities of others (website publishers) and heeding their instructions (use or non-use of robots.txt). I believe such processing cannot, as a matter of law, rise to the level of “controller” activities.

Finding Google to be a “controller” may have been correct if either the facts or the complaint had been different. Had Mr. Costeja Gonzalez produced evidence that: (i) the web pages he wanted removed contained the “robots.txt” instruction or, (ii) the particular web pages were removed from the Internet by the publisher but not by Google in its search results, then it may be appropriate to hold Google as a “controller” due to these independent activities. Such facts would be similar to the example given by the Article 29 Working Party of an ISP’s independent use of personal data maintained by its web hosting customers. Similarly, had Mr. Costeja Gonzalez’s complaint been that search results regarding his prior bankruptcy been ranked too high, then I could understand (albeit I may still disagree) that Google would be found to be a controller. But that was not his complaint. His complaint was that certain information was included in the index at all – and for that, I believe, Google should have no more control over than it has in the content of the Internet itself.

2. “Relevance” of Personal Data must be evaluated in light of the purpose of the processing.

The Court’s second error arose in the application of the controller’s obligations. Interestingly, after finding that Google is the controller of the index, the Court incorrectly applied the relevancy question. To be processed legitimately, personal data must be “relevant and not excessive in relation to the purposes for which they are collected and/or further processed.” Directive 95/46 EC, Article 6(c) (emphasis added). Relevancy is thus a question in relation to the purpose of the controller – not as to the data subject, a customer, or anyone else. The purpose of the index, in Google’s own words, is to “organize the world’s information and make it universally accessible and useful.” (https://www.google.com/about/company/). With that purpose in mind, all information on the Internet is, by definition, relevant. While clearly there are legal boundaries to the information that Google can make available, the issue is whether privacy law contains one of those boundaries. I suggest that in the context of caching and creating an index of the Internet, it is not.

The court found that Google legitimizes its data processing under the legitimate interest test of Article 7(f) of the Directive. Google’s legitimate interests must be balanced against the data subjects’ fundamental rights under Article 1(1). Since Article 1(1) provides no guidance as to what those rights are (other than “fundamental”), the Court looks to subparagraph (a) of the first paragraph of Article 14. That provision provides data subjects with a right to object to data processing of their personal data, but offers little guidance as to when controllers must oblige. Specifically, it provides in cases of legitimate interest processing, a data subject may,

“object at any time on compelling legitimate grounds relating to his particular situation to the processing of data relating to him, save where otherwise provided by national legislation. Where there is a justified objection, the processing instigated by the controller may no longer involve those data.”

What are those “compelling legitimate grounds” for a “justified objection”? The Court relies on Article 12(b) “the rectification, erasure, or blocking of data the processing of which does not comply with the provisions of this Directive, in particular because of the incomplete or inaccurate nature of the data”. It is here the Court erred.

The Court took the phrase “incomplete or inaccurate nature of the data” and erroneously applied it to the interests of the data subject. Specifically, the Court held that the question is whether the search results were “incomplete or inaccurate” representations of the data subject as he/she exists today. I submit that was not the intent of Article 12(b). Rather, Article 12(b) was referring back to the same use of that phrase in Article 6 providing that:

“personal data must be: . . .(d) accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that data which are inaccurate or incomplete, having regard to the purposes for which they were collected or for which they are further processed, are erased or rectified.

The question is not whether the search results are “incomplete or inaccurate” representations of Mr. Costeja Gonzalez, but whether the search results are inaccurate as to the purpose of the processing. The purpose of the processing is to copy, sort, and organize the information on the internet. In this case, queries for the characters “Mario Costeja Gonzalez,” displayed articles that he admits were actually published on the Internet. Such results, therefore, are by definition not incomplete or inaccurate as to the purpose of the data processing activity. To put it simply, the Court applied the relevancy test to the wrong party (Mr. Costeja Gonzalez) as opposed to Google and the purpose of its index.

To explain by analogy, examine the same legal tests applied to a credit reporting agency. A credit reporting analogy is helpful because it also has at least three parties involved in the transaction. In the case of the search engine, those parties are the search engine, the data subject, and the end user conducting the search. In the case of credit reporting, the three parties involved are the credit ratings businesses, the consumers who are rated (i.e., the data subjects), and the lenders and other institutions that purchase the reports. It is well-established law that consumers can object to information used by credit ratings businesses as being outdated, irrelevant, or inaccurate. The rationale for this right is found in Article 12(f) and Article 6 in relation to the purpose of the credit reporting processing activity.

The purpose of credit reporting is to provide lenders an opinion on the credit worthiness of the data subject. The credit ratings business must take care that the information they use is not “inaccurate or incomplete” or they jeopardize the purpose of their data processing by generating an erroneous credit score. For example, if a credit reporting agency collected information about consumers’ height or weight, consumers would be able to legitimately object. Consumers’ objections would not be founded on the fact that the information is not representative of who they are – indeed such data may be completely accurate and current. Instead, consumers’ objections would be founded on the fact that height or weight are not relevant for the purpose of assessing consumers’ credit worthiness.

Returning to the Costeja Gonzalez case, the issue was whether the index (not the ranking of such results) should include particular web pages containing the name of Mr. Costeja Gonzalez. Since the Court previously determined that Google was the “controller” of the index (which I contend was error), the Court should have determined Google’s purpose of the index and then set the inquiry as to whether the contested web pages were incorrect, inadequate, irrelevant or excessive as to Google’s purpose. As discussed above, Goolge’s professed goal is to enable the discovery of the world’s information and to that end the purpose of the index is to, as much as technologically possible, catalog the entire Internet – all the good, bad, and ugly. For that purpose, any content on the Internet about the key words “Mario Costeja Gonzalez” is, by definition, not incorrect, inadequate, irrelevant or excessive because the goal is to index everything. Instead, however, the Court erred by asking whether the web pages were incorrect, inadequate, irrelevant or excessive as to Mr. Costeja Gonzalez, the data subject. Appling the relevancy question as to Mr. Costeja Gonzalez is, well, not relevant.

Some may argue that the Court recognized the purpose of the processing when making the relevancy determination by finding that Mr. Costeja Gonzalez’s rights must be balanced against the public’s right to know. By including the public’s interest in the relevancy evaluation, some may argue, the Court has appropriately directed the relevancy inquiry to the right parties. I disagree. First, I do not believe it was appropriate to inquire as to the relevancy of the links vis-a-vis Mr. Costeja Gonzalez in the first instance and, therefore, to balance it against other interests (in this case the public) does not cure the error. Secondly, to weigh the interests of the public, one must presume that the purpose of searching for individuals’ names is to obtain correct, relevant, not inadequate and not excessive information. I do not believe such presumptions are well-founded. For example, someone searching for “Scott Goss” may be searching for all current, relevant, and non-excessive information about me. On the other hand, fifteen years from now perhaps someone is searching for all privacy articles written in 2014 and they happened to know that I wrote one and so searched using my name. One cannot presume to know the purpose of an individual’s search query other than a desire to have access to all the information on the Internet containing the query term.

If not the search engines, where would it be pertinent to ask the question of whether information on the Internet about Mr. Costeja Gonzalez was incorrect, inadequate, irrelevant or excessive? The answer to this question is clear: it is the entities that have undertaken the purpose of publishing information about Mr. Costeja Gonzales. Specifically, website publishers process personal data for the purpose of informing their readers about those individuals. The website publishers, therefore, have the burden to ensure that such information is not incorrect, inadequate, irrelevant or excessive as to Mr. Costeja Gonzales. That there may be an exception in data protection law for web publishers, does not mean that courts should be free to foist obligations onto search engines.

3. The right to force search engines to inaccurately generate results

Finally, the “right to force search engines to inaccurately generate results” is, I believe, an apt description of the ruling. A search engine’s cache and index is supposed to contain the entire web’s information that web publishers want the world to know. Users expect that search engines will identify all information responsive to their queries when they search. Users further expect that search engines will rank all the results based upon their determination of the relevancy of the results in relation to the query. The Court’s ruling forces search engines to generate an incomplete list of search results by gathering all information relevant to the search and then pretending that certain information on the Internet doesn’t exist on the Internet at all. The offending content is still on the Internet, people just cannot rely on finding the content using individuals’ names entered into search engines (at least the search engines on European country-coded domains).

[1] These thoughts are my own and not the company for which I work and I do not profess to be an expert in search technologies or the arguments made by the parties in the case.

FERPA | SHERPA: Providing a Guide to Education Privacy Issues

Education is changing. New technologies are allowing information to flow within classrooms, school, and beyond, enabling new learning environments and new tools to understanding and improve the way teachers teach and students learn. At the same time, however, the confluence of enhanced data collection with highly sensitive information children and teens also makes for a combustive mix from a privacy perspective. Even the White House recognizes this challenge! Its recent Big Data Review specifically highlighted the need for responsible innovation in education.

There are many organizations – many of which we’ve partnered with – working tirelessly to privacy issues in education and provide the best experience for students. So too is the Department of Education. Yet these resources are scattered. The need for an education privacy resource clearinghouse is clear. With “back to school” now in full-swing, we thought it a great time to launch FERPA|SHERPA. The site – named after the core federal law that governs education privacy – aims to provide a one-stop shop for education privacy-related offerings of interest to parents and schools, as well as education service providers and the policymakers struggling to grapple with the legal landscape.

Everyone in the educational ecosystem has a role to play here, lest legitimate privacy concerns combine with other worries to overwhelm the benefits of education technologies and the expanded use of student data. One need only look at the recent collapse of inBloom – a new technology platform that school systems were clamoring for until a combination of poor communication and privacy fears came to dominate any and all conversations about the underlying technologies – as an example of the need for schools and the companies they partner with to better address education privacy issues.

To ensure parents have a voice in the ongoing privacy debate, the site will also host a blog written by parent privacy advocate Olga Garcia-Kaplan, a Brooklyn, NY public school parent of three children.

Additionally, we’re also releasing an education privacy whitepaper by Jules Polonetsky, our executive director, and Omer Tene, Vice President, Research & Education, IAPP, that analyzes the opportunities and challenges of data-driven education technologies and how key stakeholders should address them. The piece – “The Ethics of Student Privacy: Building Trust for Ed Tech” – was recently published in a special issue of the International Review of Information Ethics, “The Digital Future of Education.”

We hope FERPA | SHERPA will help get everyone on the same page when it comes to privacy issues around student data. We would love your feedback and thoughts on the new site, and we look forward to helping to jump start conversations about education privacy in the new school year. If we’ve missed something or you’d like to join our effort, please reach out to [email protected].

Future of Privacy Forum Launches One-Stop Shop Website for Student Privacy

FUTURE OF PRIVACY FORUM LAUNCHES ONE-STOP SHOP WEBSITE FOR STUDENT PRIVACY

FPF Urges Parents, Teachers and Policymakers to Follow the “ABC’s” of Education Privacy & Make “D” for Data Protection

WASHINGTON, D.C. – August 21, 2014 – As schools increasingly rely on data to improve education, and as teachers increasingly rely on technology in the classroom to improve the learning experience, privacy concerns are being raised about the collection and use of student data. With ‘back to school’ now in full-swing, and to address both the promise and challenges surrounding privacy and data in education, the Future of Privacy Forum (FPF) today unveiled a first-of-its-kind, one-stop shop resource website providing parents, school officials, policymakers, and service providers easy access to the laws, standards and guidelines that are essential to understanding student privacy issues and navigating a responsible path to managing student data with trust, integrity, and transparency.

More than at any other time in the evolution of education, data-driven innovations and use of emerging technologies – such as online textbooks, apps, tablets and mobile devices, and internet-based learning – are bringing advances and critical improvements in teaching and learning, with profound implications.

At the same time, the increased use of vendors and data is matched by the need for heightened responsibility to manage and safeguard student data and implement policies that benefit education and minimize risk. Concerns are being raised about how student data is collected and used in a next-stage learning ecosystem buzzing with social media, mobile devices, central databases, student records, Big Data, and an array of vendors and software.

The new, resource-rich FERPA|SHERPA website – named after the core federal law that governs education privacy – seeks to address these opportunities and concerns. The unique site hosts a comprehensive, digital dashboard of quality education privacy-related offerings for four distinct audiences: parents, service providers, schools, and policymakers.

Some of the assets available at FERPA|SHERPA include:

Vendor quick tips – for app and software developers
Overview and explanation of relevant federal laws and policies on student data – such as FERPA, COPPA, PPRA, ESRA and CIPA
Policy papers about education privacy
Clearinghouse of education websites and resources for parents and school administrators
Topical blog that brings a parent’s perspective to the many facets of privacy issues related to learning and education
Ongoing expertise provided by FPF staff, partners, and other stakeholders to help shape and guide understanding of data privacy and responsible use

“Getting privacy right in student education requires a partnership of trust between families, teachers and schools, technology companies and education officials,” said Jules Polonetsky, executive director, FPF. “Any weak link in this chain of responsibility could undermine education and risk student data. With FERPA|SHERPA, we are making sure that the laws and best practices are easy to find.”

“Since our creation, Edmodo has been focused on safeguarding user privacy, and we’re excited to partner with FPF on this effort to provide schools, teachers, and parents with great resources about student privacy issues,” said Aden Fine, chief privacy officer of Edmodo. “Education is critical to addressing questions about privacy, and we think the FERPA|SHERPA website will really help the public better understand these complicated issues.”

“Parents have to sort through a tremendous amount of information issued about student data privacy to learn how and why data compiled pertaining to their children may be used. As a parent, the FERPA|SHERPA site is an invaluable resource for obtaining timely, accurate and impartial information necessary to understand this evolving landscape,” said Olga Garcia-Kaplan, parent and advocate for student data privacy.

“Educational leaders, service providers, parents and policy makers increasingly need accurate and reliable information on privacy issues. For too long, it has been a real challenge to find that information. The Future of Privacy Forum’s new FERPA|SHERPA is a great starting place to find what you need,” said Keith Krueger, CEO, Consortium for School Networking.

Protecting student data and privacy involves navigating myriad regulations, policies, and practices,” said Marsali Hancock, CEO & President, iKeepSafe. “iKeepSafe has worked with schools, parents, students, and industry to promote safe and effective use of technology, and we are thrilled that FERPA|SHERPA is providing these stakeholders with additional resources on important laws and best practices to protect student data.”

The FERPA|SHERPA website initiative – which began in the fall of 2013 – is the first of many offerings generated by the FPF on education privacy, which began as the FPF invested its privacy expertise and leveraged staff talent in education issues and subsequently developed a comprehensive education privacy campaign with wide stakeholder engagement – including parents, teachers, school administrators, trade associations, and leading education and technology companies in the private sector.

In addition, the FPF today released an education privacy whitepaper that has been published in a special issue of the International Review of Information Ethics, “The Digital Future of Education.” The piece – “The Ethics of Student Privacy: Building Trust for Ed Tech” – is authored by Polonetsky and Omer Tene, Vice President, Research & Education, IAPP, and analyzes the opportunities and challenges of data-driven education technologies and how key stakeholders should address them.

About the Future of Privacy Forum

The Future of Privacy Forum (FPF) is a Washington, DC-based think tank that seeks to advance responsible data practices. The forum is led by internet privacy experts Jules Polonetsky and Christopher Wolf and includes an advisory board comprised of leading figures from industry, academia, law and advocacy groups. Visit fpf.org.

Media Contact:

Nicholas Graham, for Future of Privacy Forum

571-291-2967

[email protected]

FPF Statement on Today's Safe Harbor Complaint

Today, the Center for Digital Democracy filed a complaint with the Federal Trade Commission, alleging that companies are violating the U.S.-EU Safe Harbor agreement. CDD’s filing came with a report criticizing the practices of thirty companies.

“We are carefully reviewing the report’s claims, but the dozen we have examined so far seem to reflect the authors distaste for marketing, rather than legal safe harbor violations,” said Jules Polonetsky, Executive Director, Future of Privacy Forum.

The Future of Privacy Forum has long focused on the value of the Safe Harbor agreement, and issued a comprehensive report on the framework last fall.

Cross Border Privacy Rules Advance at Beijing Meetings

APEC’s Data Privacy Subgroup concluded its 2014 meetings in Beijing, China earlier this week. The Future of Privacy Forum participated in these meetings as a member of the U.S. delegation. The biggest development of the week was Canada’s submission of its Notice of Intent to participate in the Cross Border Privacy Rules (CBPR) system. After a favorable determination by the APEC’s Joint Oversight Panel, Canada will become the fourth country to join the system, along with the United States, Mexico and Japan. In addition, TRUSTe, an APEC-approved Accountability Agent, announced that 14 companies are in the process of seeking certification. Taken together, these developments, along with Mexico’s recent steps toward interoperability have provided promising momentum in the establishment of an international privacy framework.

Still much work remains before the true potential of the system can be fully realized. In July, FPF hosted officials from Privacy Thailand, a University-based consortium that advises the Thai Prime Minister’s office on data privacy and security issues. During their week-long visit, FPF and Privacy Thailand met with representatives from the Department of Commerce, the Federal Trade Commission and the U.S. Department of State to consider Thailand’s accession to the system. FPF will continue work with interested APEC members to provide capacity building assistance.

On August 8, APEC Economies and representatives from the EU’s Article 29 Working Party met to discuss next steps on the jointly developed Common Referential. This document identifies points of commonality between the CBPR system and the EU’s system of Binding Corporate Rules (BCRs). APEC members agreed to take this work forward by developing case studies that demonstrate the practical interoperability of these two systems and a checklist outlining the combined obligations for a company seeking certification under both.

On August 10, APEC Economies agreed to establish a working group to consider the applicability of the APEC Privacy Framework to Big Data. This group will consider, among other things, appropriate administrative and policy safeguards when de-identifying personal information. FPF plans to participate in this working group.

Participants continued the development of a CBPR certification system for data processors. In July, FPF hosted a meeting of this working group to develop the program requirements under this certification. Completion of this project is expected in advance of the next APEC Data Privacy Subgroup meetings in Clark, Philippines in January, 2015.

Comments to NTIA on Big Data and Privacy

Today, FPF submitted comments to the NTIA as it begins its exploration of how big data impact the Consumer Privacy Bill of Rights. While the NTIA sought comment on over a dozen key questions, our filing focus largely on four issues: (1) the need for additional clarity surrounding the flexible application of the Consumer Privacy Bill of Rights’ privacy principles, (2) challenges to the “notice and choice” model and using context to inform a use-based approach to data use, (3) practical de-identification, and (4) what internal review boards might look like and consider in the age of big data.

Much of our filing builds upon FPF’s thinking on how to develop a benefit-risk analysis for data protects, with big data concerns of particular importance. Industry increasingly faces ethical considerations over how to minimize data risks while maximizing benefits to all parties. As the White House’s earlier Big Data Report acknowledged, there is a potential tension between socially beneficial and privacy invasive uses of information in everything from educational technology to consumer generated health data. The advent of big data requires active engagement by both internal and external stakeholders to increase transparency, accountability and trust.

FPF believes that a documented review process could serve as an important tool to infuse ethical considerations into data analysis without requiring radical changes to the business practices or innovators or industry in general. Institutional review boards (IRBs), which remain the chief regulatory response to decades of questionable ethical decisions in the field of human subject testing, provide a useful precedent for focusing on good process controls as a way to address potential privacy concerns. While IRBs have become a rigid compliance device and would be inappropriate for wholesale use in big data decision-making, they could provide a useful template for how projects can be evaluated based on prevailing community standards and subjective determinations of risks and benefits, particularly in cases involving greater privacy risks. Using an IRB model as inspiration, big data may warrant the creation of new advisory processes within organizations to more fully consider ethical questions posed by big data.

Moving forward, broader big data ethics panels could provide a commonsense response to public concerns about data misuse. While these institutions could provide a further expansion of the role of privacy professionals within organizations, they might also provide a forum for a diversity of viewpoints inside and out of organizations. Ethics reviews could include members with different backgrounds, training, and experience, and could seek input from outside actors including consumer groups and regulators.While these panels will vary between the public and private sector, businesses and researchers, they could provide an important check on any data misuse.

Organizations and privacy professionals have become experienced at evaluating risk, but they should also engage in a rigorous data benefit analysis in conjunction with traditional privacy risks assessments. FPF suggests that organizations could develop procedures to assess the “raw value” of a data project, which would require organizations to identify the nature of a project, its potential beneficiaries, and the degree to which those beneficiaries would benefit from the project. Our guidance for this process is included in our filing for the first time.

Of course, big data hasn’t changed all the rules. And not every use of big data implicates our privacy. Many uses of big data are machine-to-machine or highly aggregated. Many new uses of data are marginal, which our current processes for mitigating risks can well address.

De-Identification: A Critical Debate

Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.

Why de-identification is a key solution for sharing data responsibly

Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)

Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)

Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight: De-Identification Does Work) by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true. Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.

Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.

It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification

Real-world evidence shows that the risk of re-identifying properly anonymized data is very small

Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.

It is possible to de-identify location data

The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.

Computing re-identification probabilities is not only possible, but necessary

The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.

Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.

The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.

We should consider realistic threats

The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.

The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow

The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.

Computing correct probabilities for the Heritage Health Prize data set

One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.

In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.

The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.

It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.

The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!

The authors do not propose alternatives

The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.

The authors pose a false dichotomy for the future

The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.

We strongly disagree with that presentation of the alternatives. First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.

Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).

Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.” We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data. We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.

A call to action

It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.

References

[1] K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.

[2] Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.

[3] S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.

[4] K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.

[5] L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.

[6] L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.

[7] K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.

[8] K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.

[9] K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.

[10] K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.

Lessons from Fair Lending Law for Fair Marketing and Big Data