Comments for the FTC's Workshop on "Internet of Things"

FPF today offered comments to the FTC in advance of a public workshop on new security and privacy issues presented by growing networks of connected devices.  Commonly referred to as the “Internet of Things,” these physical devices range from appliances and vehicles to our smart phones, and present an elaborate array of objects that capture, share, and use data.

The Internet of Things has been a focus of FPF’s work since our founding, starting with our original project on the Smart Grid and continuing to our recent projects on Connected Cars and Smart Stores.  While connected, smart devices provide many benefits, new ways to protect consumer privacy may need to be explored.  Connected devices present circumstances where our traditional Fair Information Practice Principles (FIPPs) may not be available or practical.  Codes of conduct, seals and other public-facing and enforceable commitments are examples of how to address the privacy issues in the Internet of Things.

Our full set of comments is available to read here.

New Report Shows Cybersecurity Risks from FBI “Going Dark” Proposal

Today’s New York Times discusses a major new report by 20 technologists about the cybersecurity risks that would result an FBI plan to expand wiretapping capabilities on the Internet.  The administration is reportedly close to sending the FBI proposal to Capitol Hill, to amend the Communications Assistance to Law Enforcement Act of 1994.

FPF Senior Fellow Peter Swire blogs about this issue today at the International Association of Privacy Professionals website.  His post draws on work he has done at FPF with Kenesa Ahmad.  Swire writes:

The FBI argues that new wiretapping mandates on the Internet are needed because it is “going dark,” because new and evolving Internet technologies mean that government may not have a way to get the content of communications with a wiretap order.  In a 2011 paper, Kenesa Ahmad and I argued that “going dark” is the wrong image, and that today should instead be understood as a “golden age of surveillance.”  As members of the IAPP know, law enforcement and national security agencies today have far greater data gathering capabilities than ever before, such as: (1) location information; (2) information about contacts and confederates; and (3) an array of new databases that create digital dossiers about individuals’ lives.

As the debate heats up about expanding CALEA requirements to the Internet, there are thus strong privacy and cybersecurity reasons for concern about the FBI’s proposed approach.

What's Scary About Big Data, and How to Confront It

Any discussion surrounding the benefits–and the risks–presented by Big Data often focuses on the far-off future.  The world of Minority Report is frequently invoked, but in the wake of April’s “Big Data Week,” it is time to recognize that Big Data is already here.  In their recent book, Big Data: A Revolution that Will Transform How We Live, Work, and Think, Viktor Mayer-Schönberger and Kenneth Cukier act as heralds of Big Data, and suggest that the real phenomenon is the “datafication” of our world.  They describe the transformation of our entire world into “oceans of data that can be explored” that can provide us with a new perspective on reality.  The language and rhetoric in the book highlight Big Data’s potential: the scale of Big Data, they suggest, allows us to “extract new insights” and “create new forms of value” in ways that will fundamentally change how we interact with one another.

These new insights can be used for good or for ill, but that’s true of any new piece of knowledge.  What exactly is it then that some find so disconcerting about Big Data?

Mayer-Schönberger and Cukier recognize that Big Data is on a “direct collision course” with our traditional privacy paradigms, and further, that it opens the door to create the sort of propensity models seen in Minority Report.  However, the pair are more concerned with what they term the “dictatorship of data.”  They fear that well-meaning organizations may “become so fixated on the data, and so obsessed with the power and promise it offers, that [they] fail to appreciate its limitations.”

And these limitations are very real.  The popular statistician Nate Silver argues that it is time to admit that “we have a prediction problem.  We love to predict things–and we aren’t very good at it.” It is this dynamic that presents the biggest worries about Big Data.  Its promise is that by transforming our entire world, our whole experience into data points that numbers will be able to speak for themselves, but this alone will not cure our prediction predilection.  As Kate Crawford of Microsoft Research recently pointed out, Big Data is full of hidden biases. “Data and data sets are not objective,” she states. “They are creations of human design.”

Google Flu Trends is often held out as something that can only be done on the scale provided by Big Data.  Using aggregated Internet searches to chart the spread of a disease demonstrates how seemingly mundane web browsing can produce new insights, but it is important to recognize the limitations behind the project’s underlying algorithms.  Google Flu Trends got things wrong this year. Why?  As Google admits, not everyone who searches for “flu” is actually sick. This year, due to extensive media coverage, more people than anticipated were using Google to learn more.  The result was that the algorithms behind the scenes began to see signs of the flu’s spread where it didn’t actually exist. Google Flu Trends’ mistake can be excused for a number of reasons: not only is the tool largely a data experiment, but it also has a generally benevolent purpose.  Had a similar algorithm informed a decision by the CDC to quarantine a community or otherwise directly impact individuals, it would be a different conversation. Organizations and individuals need to become more aware of the biases and assumptions that underlie our datafied world.

This requires establishing a data conversation among users. In order to strengthen our understanding of individual privacy without cutting off technological innovation, individuals need to be educated about how their data is used. To start this conversation, we need more transparency. Jules Polonetsky and Omer Tene suggest that organizations should disclose the logic underlying their decision-making processes as best as possible without compromising their algorithmic “secret sauce.” This information has two key benefits: it allows us to monitor how data is used, and it also allows individuals to become more active participants in how their data is used.

Today, the data deluge that Big Data presents encourages a passivity and misguided efforts to get off the grid.  With an “Internet of Things” ranging from our cars to our appliances, even to our carpets, retreating to our homes and turning off our phones will do little to stem the datafication tide. Transparency for transparency’s sake is meaningless.  We need mechanisms to achieve transparency’s benefits. We need to encourage users to see their data as a feature that can be turned on or off, and toggled at will. Letting users declare their own data preferences will encourage individuals to care about what their data says about them and how to actively engage in how their information is processed.

The challenge will be making this process both easily accessible and fun for users. The BlueKai Registry suggests one possible avenue by allowing consumers to see what data companies think about their computer, and Google and Yahoo already offer settings managers for users to select who sees what data. More organizations must think carefully about how best to strike the balance between offering user-friendly and comprehensive controls.

At the same time, transparency also allows experts to police companies in order to monitor, expose, and prevent practices we do not want. Mayer-Schönberger and Cukier call for the rise of the “algorithmist,” a new professional that would evaluate the selection of data sources, the choice of analytical tools, and the algorithms themselves. While offering individuals opportunities to understand and to challenge how decisions about them is important, internal algorithmists alongside the watchful eyes of regulators and privacy advocates can help to ensure that companies are held accountable. This could go a long way toward alleviating fears about Big Data and providing an environment where society can safely maximize its benefits.

New Study Shows Need for De-identification Best Practices

Publically releasing sensitive information is risky.  In 1997, Latanya Sweeney used full date of birth, 5 digit ZIP code, and gender to show that seemingly anonymous medical data could be linked to an actual person when she uncovered the health information of William Weld, the former governor of Massachusetts.   Sweeney in a new study analyzes the data available in the Public Genome Project (PGP) and shows once again that many people can be re-identified by using date of birth, ZIP, and gender, when other data such as a voter registration list is available.

Sweeney’s work is important, but we don’t think it should be considered an indictment of de-identification.   The cases so often cited as proof that de-identification doesn’t work – the AOL Search data release, the Netflix prize, the Weld example and the PGP data – are all examples of barely or very poorly de-identified data.  De-identification experts do NOT consider a publically disclosed database with full date of birth, 5 digit ZIP code, and gender de-identified.  In fact, those three data points divide the US population into over 3 billion unique combinations.  Full date of birth divides a population into over 36 thousand separate groups and ZIP codes further divide the US population into over 43 thousand separate groups.  Publically releasing a database with such a large number of unique combinations allows additional databases to be added and gives attackers all the time in the world to examine the data. Thus, public disclosure greatly increases the risk of identifying individuals from a database.

Sweeney’s study shows the importance of very strong de-identification practices when data is disclosed publically.  With public data, organizations should use very strong de-identification techniques, such as the Privacy Analytics Risk Assessment Tool developed by Dr. Khaled El Emam or the use of differential privacy as proposed by Dr. Cynthia Dwork.

For nonpublic databases, however, strong de-identification techniques may not strike the right balance between data utility and privacy.  When nonpublic databases are protected by both technical and administrative controls, reasonable de-identification techniques, as opposed to very strong de-identification techniques, may be appropriate.  Attackers do not have unlimited time to attempt to break the technical de-identification protection, third party data is not available, and measures are in place to provide legal commitments.  Data breaches can occur of course, but certainly we need to recognize the very different status of protected versus unprotected data and should appreciate the range of protections that can support a de-identification promise.

FPF staff are conducting research exploring the different risk profiles of nonpublic databases and publically released databases and the relevant best practices for “pretty good” de-identification for restricted databases.  Please contact us if you are interested.