White House/MIT Big Data Privacy Workshop Recap
Speaking for everyone snowed-in in DC, White House Counselor John Podesta remarked that “big snow trumped big data,” while on the phone to open the first of the Obama Administration’s three big data and privacy workshops. This first workshop focused on advancing the “start of the art” in technology and practice. While these workshops are ultimately the product of Edward Snowden’s NSA leaks last year, Mr. Podesta explained that his big data review group was conducting a broad review on a “somewhat separate track” from an ongoing review of the intelligence community. His remarks focused on several specific example of the social value of data, but he cautioned that “we need to be conscious of the implications for individuals. How should we think about individuals’ sense of their identity when data reveals things they didn’t even know about themselves?”
To that end, he noted that “we can’t wait to get privacy perfect to get going,” and noting this workshop was designed to talk about technology around data, he hoped the workshop would help inform the Administration about what it needs to take away about the state of data privacy right now.
Cynthia Dwork, from Microsoft Research, followed Mr. Podesta with a deep-dive into differential privacy. In English, as she put IT, differential privacy works to ensure that the outcome of any analysis is equally likely, independent of whether an individual join or does not join a database. The goal is to limit the range of potential harms to any individual from participating in data analysis. The challenge posed by big data is that multiple uses of data create a cumulatively harm to privacy, which is difficult to measure. Overly accurate estimates of too much information are “blatantly non-private,” Dwork argued.
While Dwork focused on new technologies to advance privacy, a slate of MIT professors presented brief examples of how big data is providing big social benefits in health care, transportation, and education:
- John Guttag discussed the importance of large scale data for clinical studies. He pushed back against requiring very specific consents for patient data use, suggesting they would do a lot of harm. ‘We find a lot of data for one purpose that can be used for another. It’s important not to be too specific.” He suggested meaningful consent could be gained simply be educating patients about the value of their data. “I think we underestimate the members of our society,” he said. “I think most people fear death or the death of a loved one more than a loss of privacy.” Manolis Kellis explored how large numbers of data sets are essential to advance discoveries in human disease genomics. He argued that much of our discussion is caught up by the mere illusion of privacy: “Every time you take your coat off, you’re providing your DNA to someone.” Thus, we need to implement restrictions that would mitigate negative uses, such as insurers using genomic data to discriminate against individuals.
- Sam Madden connected the challenges posed by big data to the parallel phenomenon of the Internet of Things. He noted that societal apps and societal applications of data both have privacy concerns, and argued that very compelling societal goods come from “societal roll-ups of data.” For example, he discussed how risky driver behavior could be mitigated through surveillance — the riskiest category of male drivers will reduce bad driving habits by up to 72% if monitored. “We can argue that this is creepy, but it’s societally compelling.” he said. “We — as a society — have to decide what we’re comfortable with.
- Anant Agarwal, president of edX, the massively open online course (MOOC) platform created by Harvard and MIT, described big data as a “particle accelerator” for learning. Noting that edX has students in every country in the world, MOOCs can provide interesting insights into how students learn and how they interact with peers. He described data that showed how students over time began tackling homework prior to lectures, and suggested that data could eliminate subjective guesswork in education. The challenge is that a lot of the data benefits from education can only be derived through information sharing, yet adequately protecting individual student information can be challenging. Mr. Agarwal noted that his daughter used the same username on edX as she does on Facebook. “We can omit that information,” but students use also identifiers in forums and in other formats, he said. “We’d like to share the data we get with everyone,” he said, but he wondered how that could be done safely. “What is de-identification?” he asked.
When the floor was opened for questions, a skeptic in the crowd noted that one of the biggest drivers of data collection is not social benefits, but rather to make money. Mr. Agarwal suggested that was the very reason edX was a non-profit was in order for its use of sensitive data “to be judged by different criteria than maximizing return on investment.”
Secretary of Commerce Penny Pritzker suggested that harnessing the potential of data would hinge upon user trust. Highlighting Commerce’s efforts to advance multistakeholder codes of conduct and ensure the efficacy of the U.S.-EU Safe Harbor, Ms. Pritzker suggested government needed to continually evaluate and work with companies to uncover the technologies and practices that promote trust. She expressed hope that efforts like the day’s workshop could help to show that confidence placed in American companies should remain “rock solid.”
The program’s afternoon shifted to a broad discussion of privacy enhancing technologies (PETs), specifically developments in differential privacy, encryption, and accountability systems. There was a recognition that with any computer system that compromises in security and privacy are inevitable — complex software will have bugs, many different people will need access to infrastructure, and interesting data processing will require the use of confidential information or PII.
Danny Weitzner lamented a better definition of privacy for computer designers and engineers to build toward. Alan Westin’s original definition, that privacy is a claim by an individual, groups, or institutions to determine for themselves when, how, and to what extent their information can be communicated to others, has “led us astray,” Prof. Weitzner argued. He argued that throughout the day multiple substantive definitions of privacy had come up in discussion, and he argued that we “need a way to know what’s going on” in order to “allow data for some purposes, but won’t be misused for others.”
Quoting Judge Reggie B. Walton about the challenges facing the FISA Court, Weitzner noted that “we don’t currently have the tools in everyday systems to assess how information is used.” Weitzner discussed his work on information accountability.
Weitzner then led a large hypothetical discussion where MIT in the near-future “embrace[s] the potential of data-powered, analytics-driven systems in all aspects of campus life, from education to health care to community sustainability.” Weitzner asked a slate of panelists what they would do as the future chief privacy officer of MIT, and Intel’s David Hoffman suggested that we all need to understand “that a lot of the data about us that’s now out there is not coming from us.” As a result, meaningful transparency must mean more than notice to individuals. Panelists then hit a wide-gamut of issues from the ethical challenges around predictive analysis and the need to get serious about addressing questions about use, teeing up the Administration’s next workshop on the ethics of big data.