Data Sharing … By Any Other Name(Would Still Be a Complex, Multi-stakeholder Situation)
“It is widely agreed that (data) should be shared, but deciding when and with whom raises questions that are sometimes difficult to answer.”[1]
Data sets held by commercial and government organizations are an increasingly necessary and valuable resource for researchers. Such data may become the evidence in evidence based policymaking[2] or the data used to train artificial intelligence.[3] Some large data sets are controlled by government agencies, non-governmental organizations or academic institutions, but many are being accumulated within the private sector. Academic researchers value access to this data as a way to measure any number of consumer, commercial, and scientific questions at a scale they are unable to reach using conventional research data gathering techniques alone. Such data allows researchers access to information that allows them to answer questions on topics ranging from bias in targeted advertising, to the influence of misinformation on election outcomes, to early diagnosis of diseases through use of health and physiological data collected by fitness and health apps.
Recent attention on platform data sharing for research is only one conversation in the cacophony of cross-talk on data sharing. There are many different uses of the term “data sharing” to describe a relationship between parties who share data from one organization to another organization for a new purpose. Some uses of the term data sharing are related to academic and scientific research purposes, and some are related to transfer of data for commercial or government purposes. In this moment, where various types of data sharing are a concern elevated even to the attention of the US Congress and the European Commission[5], it is imperative that we are more precise which forms of sharing we are referencing so that the interests of the parties are adequately considered, and the various risks and benefits are appropriately contextualized and managed. In the table at bottom, we outline a taxonomy for the multiplicity of data sharing relationships.
Ultimately, the relationships between these entities are complex. In many cases, the relationship is 1-to-many, with a single agency or corporation sharing data with multiple researchers and civil society organizations or, as in the case with data trusts or data donation platforms, potentially one person sharing data with many research or commercial organizations through a trusted, intermediate, steward.[6] Likewise, researchers and civil society organizations may concurrently pursue data from multiple corporate or government organizations, in many cases for the ability to address those challenges that require extremely large quantities of data (Big Data) or complex networks of related data. This data flow is never just along a single channel, nor does it often stop after a single transfer. Governments and corporations share data with researchers; researchers return that data, generate new data, and share analysis and new questions and outcomes back around.
Managing these complex relationships requires multi-layered contracts, defined procedures, accountability mechanisms, and other technical and policy controls. The terms for data sharing cover obligations that both parties have, including privacy, ethics, governance, and other good stewardship protocols. Changes in the legislative landscape around data protection, privacy, and security mean that these relationships must adjust periodically to meet legal compliance obligations, on the data sharing or data using side.
At the Future of Privacy Forum, we are working to add context, nuance, and a considered evaluation of the needs of these many players to create guidelines and best practices to support data sharing, particularly for conduct of scientific and evidence-based policy research. What data is shared, under what conditions, controls, contracts, and use environments all have important privacy and governance implications. We have been actively working in this area since 2015, and continue to engage with various interested organizations around the challenges in today’s digital environment. With respect to the sharing of data itself, FPF is focused on finding ways to incorporate proportionate precautions so that any sharing activities adequately protect privacy and are designed with the full understanding of potential harms to the people whose data is transferred or the communities of which they are a part.
Data Sharing Relationships
Data Sharing Organization Type
Data Using Organization Type
Outcome of Data Sharing
Terms to Describe Data Sharing Relationship
Government Agencies
Researchers and Research Institutions
Researchers conduct evidence based evaluations of public programs
Researchers whose work is sponsored by corporations or who have privileged access to corporate data assets return data gathered for future corporate research and, in many cases, retain copies of that data for future scientific work
Researchers whose work is sponsored by or conducted under a government contract return data gathered for future agency research and, in many cases, retain copies of that data for future scientific work
Citizen groups, journalists, and communities of interest (e.g., patient advocacy groups) can gain access to data about themselves gathered during the research process so that they can use it for future treatment, advocacy, or research participation
Return of research data and/or research results[18]
Researchers and Research Institutions
Researchers and Research Institutions
Researchers can reuse other researchers’ data or combine their primary and others’ secondary data to answer novel questions without having to put people at risk of research harms by conducting further research with them
Archives can collect the primary data from multiple researcher to streamline the process of acquiring data to answer novel questions by re-examining data and not putting people at risk for research related harms by conducting further research with them
Data Stewardship bodies, such as Data Trusts or Data Donation Platforms
Researchers and Research Institutions; Government Agencies, Private Companies or Corporations
Individuals and groups share their data with others according to their interests as specified to and protected by a trusted, fiduciary, actor.
Data Trusts, Data Donation
[1] HHS Office of Research Integrity, ORI Introduction to RCR. https://ori.hhs.gov/content/Chapter-6-Data-Management-Practices-Data-sharing
[2] H.R.4174 – 115th Congress (2017-2018): Foundations for Evidence-Based Policymaking Act of 2018. (2019, January 14). https://www.congress.gov/bill/115th-congress/house-bill/4174
[3] “The Biden Administration Launches the National Artificial Intelligence Research Resource Task Force”. https://www.whitehouse.gov/ostp/news-updates/2021/06/10/the-biden-administration-launches-the-national-artificial-intelligence-research-resource-task-force/
[4] Goroff, Daniel, Jules Polonetsky, and Omer Tene. (2018). Privacy Protective Research: Facilitating Ethically Responsible Access to Administrative Data. The Annals of Political and Social Science, Vol 675, Issue 1, pp. 46-66. https://doi.org/10.1177/0002716217742605.
Harris, Leslie and Chinmayi Sharma. (2017). Understanding Corporate Data Sharing Decisions: Practices, Challenges, And Opportunities for Sharing Corporate Data with Researchers. Future of Privacy Forum. https://fpf.org/wp-content/uploads/2017/11/FPF_Data_Sharing_Report_FINAL.pdf.
[5] European Commission. (2021). “A European Strategy for Data” https://digital-strategy.ec.europa.eu/en/policies/strategy-data
[6] Open Data Institute. (2020). “Data Trusts in 2020”. https://theodi.org/article/data-trusts-in-2020
Note: This article is part of a larger series focused on managing the risks of artificial intelligence (AI) and analytics, tailored toward legal and privacy personnel. The series is a joint collaboration between bnh.ai, a boutique law firm specializing in AI and analytics, and the Future of Privacy Forum, a non-profit focusing on data governance for emerging technologies.
Behind all the hype, AI is an early-stage, high-risk technology that creates complex grounds for discrimination while also posing privacy, security, and other liability concerns. Given recent EU proposals and FTC guidance, AI is fast becoming a major topic of concern for lawyers. Because AI has the potential to transform industries and entire markets, those at the cutting edge of legal practice are naturally bullish about the opportunity to help their clients capture its economic value. Yet to act effectively as counsel, lawyers must also be vigilant of the very real challenges of AI. Lawyers are trained to respond to risks that threaten the market position or operating capital of their clients. However, when it comes to AI, it can be difficult for lawyers to provide the best guidance without some basic technical knowledge. This article shares some key insights from our shared experiences to help lawyers feel more at ease responding to AI questions when they arise.
I. AI Is Probabilistic, Complex, and Dynamic
There are many different types of AI, but over the past few decades, machine learning (ML) has become the dominant paradigm.[1] ML algorithms identify patterns in recorded data and apply those patterns to new data to try to make accurate decisions. This means that ML-based decisions are probabilistic in nature. Even if an ML system could be perfectly designed and implemented, it is statistically certain that at some point it will produce a wrong result. All ML systems incorporate probabilistic statistics, and those systems can make incorrect classifications, recommendations, or other outputs.
ML systems are also fantastically complex. Contemporary ML systems can learn billions or more rules from data and apply those rules on a myriad of interacting data inputs to arrive at an output recommendation. Embed that billion-rule ML system into an already-complex enterprise software application and even the most skilled engineers can lose track of precisely how the system works. To make matters worse, ML systems decay over time, losing their use-case fitness based on their initial training data. Most ML systems are trained on a snapshot of a dynamic world as represented by a static training dataset. When events in the real world drift, change, or crash (as in the case of COVID-19) away from the patterns reflected by that training dataset, ML systems are likely to become wrong more frequently and cause issues that require legal and technical attention. Even in the moment of the “snapshot,” there are other qualifiers for the reliability, effectiveness, and appropriateness of training data. How it’s collected, processed, and labeled all bear on whether it is sufficient to inform an AI system in a way fit for a given application or population.
While all this may sound intimidating, an existing regulatory framework addresses many of these basic performance risks. Large financial institutions have been deploying complex decision-making models for decades, and the Federal Reserve’s model risk management guidance (SR 11-7) lays out specific process and technical controls that are a useful starting point for handling the probabilistic, complex, and dynamic characteristics of AI systems. Most commercial AI projects would benefit from some aspect of model risk management, whether it’s being monitored by federal regulators or not. Lawyers at firms and in-house alike, who find themselves needing to consider AI-based systems, would do well to understand options and best practices for model risk management, starting with understanding and generalizing the guidance offered by SR 11-7.
II. Make Transparency an Actionable Priority
Immense complexity and unavoidable statistical probabilities in ML systems makes transparency a difficult task. Alas, parties deploying—and thereby profiting from—AI can nonetheless be held liable for issues relating to a lack of transparency. Governance frameworks should include steps to promote transparency, whether preemptively or as required by industry- or jurisdiction-specific regulations. For example, the Equal Credit Opportunity Act (ECOA) and the Fair Credit Reporting Act (FCRA) mandate customer-level explanations known as “adverse action notices” for automated decisions in the consumer finance space. These laws set an example for the content and timing of notifications relating to AI decisions that could adversely affect customers, as well as establish the terms of an appeals process against those decisions. Explanations that include a logical consumer recourse process dramatically decrease risks associated with AI-based products and help prepare organizations for future AI transparency requirements. New laws, like the California Privacy Rights Act (CPRA) and the proposed EU AI rules for high-risk AI systems, will likely require high levels of transparency, even for applications outside of financial services.
Some AI system decisions may be sufficiently interpretable to nontechnical stakeholders today, like the written adverse action notices mentioned above, in which reasons for certain decisions are spelled out in plain English to consumers. But oftentimes the more realistic goal for an AI system is to be explainable to its operators and direct overseers.[2]
The import of a system that’s not fully understood by its operators is that it is much harder to identify and sufficiently mitigate risks. One of the best strategies for promoting transparency, particularly in light of the challenges around “black-box” systems that are unfortunately common in the US today, is to rigorously pursue best practices with respect to AI system documentation. This is good news for lawyers who are adept in the skill and attention to detail that is required to institute and enforce such documentation practices. Standardized documentation of AI systems, with emphasis on development, measurement, and testing processes, is crucial to enable ongoing and effective governance of AI systems. Attorneys can help by creating templates for such documentation and by assuring that documented technology and development processes are legally defensible.
III. Bias is a Major Problem—But Not the Only Problem
Algorithmic bias can generally be thought of as outputs of an AI system that exhibits an unjustified differential treatment between two groups. AI systems learn from data, including its biases, and can perpetuate that bias on a massive scale. The racism, sexism, ageism, and other biases that permeate our culture also permeate the data collected about us and in turn the AI systems that are trained on that data.
On a conceptual level, it is important to note that although algorithmic bias often reflects unlawful discrimination, it does not constitute unlawful discrimination per se. Bias also includes the broader category of unfair or unexpected inequitable outcomes. While these may not amount to illegal discrimination of protected classes, they may still be problematic for organizations, leading to other types of liability or significant reputational damage. And unlawful algorithmic bias puts companies at risk of serious liability under cross-jurisdictional anti-discrimination laws.[3] This highlights the need for organizations to adopt methods that test for and mitigate bias on the basis of legal precedent.
Because today’s AI systems learn from data generated—in some way—by people and existing systems, there can be no unbiased AI system. If an organization is using AI systems to make decisions that could potentially be discriminatory under law, attorneys should be involved in the development process alongside data scientists. Those anti-discrimination laws, while imperfect, provide some of the clearest guidance available for AI bias problems. While data scientists might find the stipulations in those laws burdensome, the law offers some answers in a space where answers are very hard to find. Moreover, academic research and open-source software addressing algorithmic bias is often published without serious consideration of applicable laws. So, organizations should take care to ensure that their code and governance practices with respect to identifying and mitigating bias have a firm basis in applicable law.
Organizations are also at risk of over-indexing on bias while overlooking other important types of risk. Issues of data privacy, information security, product liability, and third-party risks, as well as the performance and transparency problems discussed in previous sections, are all critical risks that firms should, and eventually must, address in bringing robust AI systems to market. Is the system secure? Is the system using data without consent? Many organizations are operating AI systems without clear answers to these questions. Look for bias problems first, but don’t get outflanked by privacy and security concerns or an unscrupulous third party.
IV. There Is More to AI System Performance Than Accuracy
Over decades of academic research and countless hackathons and Kaggle competitions, demonstrating accuracy on public benchmark datasets became the gold standard by which a new AI algorithm’s quality is measured. ML performance contests such as the KDD Cup, Kaggle, and MLPerf have played an outsized role in setting the parameters for what constitutes “data science.”[4] These contests have undoubtedly contributed to the breakneck pace of innovation in the field. But they’ve also led to a doubling-down on accuracy as the yardstick by which all applied data science and AI projects are measured.
In the real world, however, using accuracy to measure all AI is like using a yardstick to measure the ocean. It is woefully inadequate to capture the broad risks associated with making impactful decisions quickly and at web-scale. The industry’s current conception of accuracy tells us nothing about a system’s transparency, fairness, privacy, or security, in addition to presenting a limited representation of what the construction of “accuracy” itself claims to measure. In a seemingly shocking admission, forty research scientists added their names to a paper demonstrating that accuracy on test data benchmarks often does not translate to accuracy on live data.
What does this mean for attorneys? Attorneys and data scientists need to work together to create more robust ways of benchmarking AI performance that focus on real-world performance and harm. While AI performance and legality will not always be the same, both professions can revise current thinking to imagine performance beyond high scores for accuracy on benchmark datasets.
V. The Hard Work Is Just Beginning
Unfortunately at this stage of industry and development, there are few professional standards for AI practitioners. Although AI has been the subject of academic research since at least the 1950s, and it has been used commercially for decades in financial services, telecommunications, and e-commerce, AI is still in its infancy throughout the broader economy. This too presents an opportunity for lawyers. Your organization probably needs AI documentation templates, policies that govern the development and use of AI, and ad hoc guidance to ensure different types of AI systems comply with existing and near-future regulations. If you’re not providing this counsel, technical practitioners are likely operating in the dark when it comes to their legal obligations.
Some researchers, practitioners, journalists, activists, and even attorneys have started the work of mitigating the risks and liabilities posed by today’s AI systems. Indeed, there are statistical tests to detect algorithmic discrimination and even hope for future technical wizardry to help mitigate against it. Businesses are beginning to define and implement AI principles and make serious attempts at diversity and inclusion for tech teams. And laws like ECOA, GDPR, CPRA, the proposed EU AI regulation, and others form the legal foundation for regulating AI. However, technical mitigation attempts still falter, many fledgling risk mitigations have proven ineffective, and the FTC and other regulatory agencies are still relying on general antitrust and unfair and deceptive practice (UDAP) standards to keep the worst AI offenders in line. As more organizations begin to entrust AI with high-stakes decisions, there is a reckoning on the horizon.
Author Information
Aaina Agarwal is Counsel at bnh.ai, where she works across the board on matters of business guidance and client representation. She began her career as a corporate lawyer for emerging companies at a boutique Silicon Valley law firm. She later trained in international law at NYU Law, to focus on global markets for data-driven technologies. She helped to build the AI policy team at the World Economic Forum and was a part of the founding team at the Algorithmic Justice League, which spearheads research on facial recognition technology.
Patrick Hall is the Principal Scientist and Co-Founder of bnh.ai, a DC-based law firm specializing at the intersection of AI and data analytics. Patrick also serves as visiting faculty at the George Washington University School of Business. Prior to co-founding bnh.ai, Patrick led responsible AI efforts at the high-profile machine learning software firm H2O.ai, where his work resulted in one of the world’s first commercial solutions for explainable and fair machine learning.
Sara Jordan is Senior Researcher of AI and Ethics at the Future of Privacy Forum. Her profile includes privacy implications of data sharing, data and AI review boards, privacy analysis of AI/ML technologies, and analysis of the ethics challenges of AI/ML. Sara is an active member of the IEEE Global Initiative on Ethics for Autonomous and Intelligent Systems. Prior to working at FPF, Sara was faculty in the Center for Public Administration and Policy at Virginia Tech and in the Department of Politics and Public Administration at the University of Hong Kong. She is a graduate of Texas A&M University and University of South Florida.
Brenda Leong is Senior Counsel and Director of AI and Ethics at the Future of Privacy Forum. She oversees development of privacy analysis of AI and ML technologies, and manages the FPF portfolio on biometrics and digital identity, particularly facial recognition and facial analysis. She on privacy and responsible data management by partnering with stakeholders and advocates to reach practical solutions for consumer and commercial data uses. Prior to working at FPF, Brenda served in the U.S. Air Force. She is a 2014 graduate of George Mason University School of Law.
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
[1] Commentators have often used the image of Russian nesting (Matryoshka) dolls to illustrate these relationships: AI includes machine learning, and machine learning, in turn, includes deep learning. Machine learning and deep learning have risen to the forefront of commercial adoption of AI in applications areas such as fraud detection, e-commerce, and computer vision. See, e.g., The Definitive Glossary of Higher Mathematical Jargon, MATH VAULT (last accessed Mar. 4, 2021), https://mathvault.ca/math-glossary/#algo; Eda Kavlakoglu, AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference?, IBM BLOG (May 27, 2020), https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks.
[2] In recent work by the National Institute for Standards and Technology (NIST), interpretation is defined as a high-level, meaningful mental representation that contextualizes a stimulus and leverages human background knowledge. An interpretable AI system should provide users with a description of what a data point or model output means. An explanation is a low-level, detailed mental representation that seeks to describe some complex process. An AI system explanation is a description of how some system mechanism or output came to be. See David A. Broniatowski, Psychological Foundations of Explainability and Interpretability in Artificial Intelligence (2021), https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=931426.
[3] For example, The Equal Credit Opportunity Act (ECOA), The Fair Credit Reporting Act (FCRA), The Fair Housing Act (FHA), and regulatory guidance, such as the Interagency Guidance on Model Risk Management (Federal Reserve Board, SR Letter 11–7). The EU Consumer Credit Directive, Guidance on Annual Percentage Rates (APR), and General Data Protection Regulation (GDPR) serve to provide similar protections for European consumers.
[4] “Data science” tends to refer to the practice of using data to train ML algorithms, and the phrase has become common parlance for companies implementing AI. The term dates back to 1974 (or perhaps further), coined then by the prominent Danish computer scientist Peter Naur. Data science, despite the moniker, is yet to be fully established as a distinct academic discipline.
The Spectrum of AI: Companion to the FPF AI Infographic
In December of 2020, FPF published the Spectrum of Artificial Intelligence – An Infographic Tool, designed to visually display the variety and complexity of Artificial Intelligence (AI) systems, the fields this science is based on, and a small sample of the use cases these technologies support for consumers. Today, we are releasing the white paper: The Spectrum of Artificial Intelligence – Companion to the FPF AI Infographic to expand on the information included in this educational resource, and describe in more detail how the graphic can be used as an aide in education or in developing legislation or other regulatory guidance around AI-based systems. We identify additional, specific use cases for various AI technologies and explain how the differing algorithmic architecture and data demands present varying risks and benefits. We discuss the spectrum of algorithmic technology and demonstrate how design factors, data use, and model training processes should be considered for specific regulatory approaches.
Artificial intelligence is a term with a long history. Meant to denote those systems which accomplish tasks otherwise understood to require human intelligence, AI is directly connected to the development of computer science but is based on a myriad of academic fields and disciplines, including philosophy, social science, physics, mathematics, logic, statistics, and ethics. AI, as it is designed and used today, is made possible by the recent advent of unprecedentedly large datasets, increased computational power, advances in data science, machine learning, and statistical modeling. AI models include programming and system design based on a number of sub-categories, such as robotics, expert systems, scheduling and planning systems, natural language processing, neural networks, computer sensing, and machine learning. In many cases of consumer facing AI, multiple forms of AI are used together to accomplish the overall performance goal specified for the system. In addition to considerations of algorithmic design, data flows, and programming languages, AI systems are most robust for use in equitable and stable consumer uses when human designers also consider limitations of machine hardware, cybersecurity, and user-interface design.
This paper outlines the spectrum of AI technology, from rules-based and symbolic AI to advanced, developing forms of neural networks, and seeks to put them in the context of other sciences and disciplines, as well as emphasize the importance of security, user interface, and other design factors. Additionally, we seek to make this understandable through providing specific use cases for the various types of AI and by showing how the different architecture and data demands present specific risks and benefits.
Across the spectrum, AI is a combination of various types of reasoning. Rules-based or Symbolic AI is the form of algorithmic design wherein humans draft a complete program of logical rules for a computer to follow. Newer AI advances, particularly in machine learning systems based on neural networks, are able to power computers that carry out the programmer’s initial design but then adapt based on what the system can glean from patterns in the data. These systems can score the accuracy of their results and then connect those outcomes back into the code in order to improve the success of succeeding iterations of the program.
AI systems operate across a broad spectrum of scale. Processes using these technologies can be designed to seek solutions to macro level problems like environmental challenges: undetected earthquakes, pollution control, and other natural disaster responses. They are also incorporated into personal level systems for greater access to commercial, educational, economic, and professional opportunities. If regulation is to be effective, it should focus on both technical details and the underlying values and rights that must be protected from adverse uses of AI, to ensure that AI is ultimately used to promote human dignity and welfare.
A .pdf version of the printed paper is available here.
Now, On the Internet, EVERYONE Knows You’re a Dog
An Introduction to Digital Identity
By Noah Katz and Brenda Leong
What is Digital Identity?
As you go through your day, everyone wants to know something about you. The bouncer at a bar needs to know you are over 21, the park ranger wants to see your fishing license, your doctor has to review your medical history, a lender needs to know your credit score, the police must check your driver’s license, and airport security has to confirm your ticket and passport. In the past, you would have a separate piece of paper or plastic for each of these exchanges, but the Information Revolution has caused a dramatic shift to digital and virtual channels. Shopping, banking, healthcare, gaming, even therapy and exercise, are all activities that can now be performed partially or entirely using online platforms and services. However, systems using digital transactions struggle to establish trust around personal identification because personal login credentials vary for every account and passwords are forgettable and frequently insecure. Largely because of this “trust gap,” the equivalent of personal identity credentials like a passport and driver’s license have notably lagged other services in moving to an online format. That is starting to change.
Potentially, all these tasks can be accomplished with a single “digital identity,” a system of electronically stored attributes and/or credentials that uniquely identify an individual person. Digital identity systems vary in complexity. At its most basic, a digital ID would simply recreate a physical ID in a digital format. For instance, digital driver’s licenses are coming to augment, and possibly eventually replace, the physical driver’s license or state-issued ID we carry now. Available via an app that provides the platform for verification and security, these digital versions can be used in the same way as a physical ID, to provide for authentication of our individual attributes like a unique ID number (Social Security number), birthdate (Driver’s License), citizenship (passport) or other government-issued, legal aspects of personhood.
At the other end of the spectrum, a fully integrated digital identity system would provide a platform for a complete wallet and verification process, usable both online and in the physical world. That is, it would authenticate you as an individual, as above, but also tie to all the accounts and access rights you hold, including the credentials represented by those attributes. Such a system would enable you to share or verify your school transcripts or awarded degrees, provide your health records, or access your online accounts and stored data. This sort of credentialing program can also act as an electronic signature, timestamp, or seal for financial and legal transactions.
There are a variety of technologies being explored to provide this type of platform, although there is no clear consensus or standard at this time. There are those who advocate for “self-sovereign identity,” wherein a blockchain-based platform allows individuals to directly control the release of their own information to designated recipients. There are also mobile-based systems that use a combination of cloud and local storage via a mobile device in conjunction with an app to offer a single identity verification platform.
These proposed identification systems are being designed for use in commercial circumstances as well as for accessing government systems and benefits. In countries with national identification cards (most countries other than the U.S and the UK), the national ID may come to be issued digitally even sooner. Estonia has the most advanced example of such a system, and everyone there who has been issued a government ID can provide digital signatures and authentication via their mobile platform as well as use it as a driver’s license, a health service identifier, a pass to public transport, a travel document, to vote, or for banking.
The concept of named spaces and creating unique identifiers is older than the internet itself. Started in 1841, and fully computerized by the 1970s, Dun and Bradstreet operate a database containing detailed credit information on over 285 million businesses, making them one of the key providers of analytics and other services for over a century of commercial data. Their unique 9-digit identifier is the foundation of their entire system.
The UK’s Companies House, the state registrar for public companies, traces back to the Joint Stock Companies Act of 1844, and the formation of shareholder enterprises. Like D&B, companies are recorded on a public register, but with the added requirement to include the personal data that the Registrar maintains on company personnel; for example, Directors must record name, address, occupation, nationality, and date of birth. The advent of mandatory passports in the twentieth century, along with pseudonymous identification of individuals by governments, such as with Social Security numbers, furthered this trend of personal records based on unique individual identities (and not without controversy).
With the advent of the internet, online identities exploded into every facet of financial, commercial, entertainment, and educational or professional lives, and today many people have tens, if not hundreds, of personal accounts and affiliations, each with a unique, numbered, or assigned digital record. Maintaining awareness of all our accounts has become almost impossible, much less having adequate and accurate oversight as to the security of each vendor, site, or set of login credentials. The possibility of transitioning these accounts to be interoperable with a single, secure digital ID is now becoming more feasible due to advances in mobile technology, faster and less expensive biometric systems, and the availability of cloud services and fast processing capabilities.
How Digital Identity Works
In the past, a new patient at the doctor’s office must have provided at least three separately sourced documents: a driver’s license, a health insurance card, and medical history. Even now, many offices take a physical license or insurance card and make a digital copy for their file. A digital wallet would allow a new patient to identify themselves, provide proof of insurance, and medical history all at once, via their smartphone or other access option.
Importantly, by digitally sending a one-time identity token directly to the vendor or health provider, these systems can be designed to provide the authentication or verification of a status or credential (e.g., an awarded degree), without physically handing over a smartphone and without providing the underlying data (the full transcript). By granularly releasing identity data as necessary for authorization, an ID holder will not have to include or provide more information than is needed to complete the transaction. That bouncer at the bar simply must know you are “over 21,” not your actual birthdate, much less your name and address.
An effective digital ID must be able to perform at least four main tasks:
authentication – ensure that a person is the “true” owner of an identity,
verification – verify specific identity attributes or determine the authenticity of credentials,
authorization – determine what actions may be performed or services accessed on the basis of the authenticated identity, and lastly,
federation – a process for the conveyance of authentication credentials and subscriber attributes across networked systems without sharing excess or unnecessary information.
To authenticate an individual, the system must ensure that a person is who they claim to be, protecting sufficiently against both false negatives – not allowing access to the legitimate account holder, as well as false positives – wrongly allowing access to unauthorized individuals. Security best practices require that authentication be accomplished via a multi-factor system, requiring two of the three options: something you know (password or pin code, security question), something you have (a smart card, specific mobile device, or USB token), or something you are (a biometric).
[NOTE: a biometric is a unique, measurable physical characteristic which can be used to recognize or identify a specific individual. Facial images, fingerprints, and iris scans samples are all examples of biometrics. For authentication purposes, such as in the digital identity systems under discussion, biometrics are matched on a 1:1 or 1:few process against an enrolled template. The template, specific to the system provider and not interoperable with other systems, may be stored locally on the device, or in cloud storage. However, since operational or circumstantial considerations may preclude the use of biometrics in all cases, systems intended for mass access must offer alternatives as well. The details of biometric systems and the particular risks and benefits thereof are beyond the scope of this article, but while not all digital identity systems are based on biometrics, most will likely include some form of biometric within their authentication processing.]
Once an ID holder is authenticated, the specific attributes or credentials must be verified. This involves confirming that the ID holder has earned or been issued the credentialed attributes they are claiming, whether from a financial institution, an employer, an educational institution, or a government agency.
Authentication and verification may be all that is required for some transactions, but where needed, the system must also be able to confirm authorization, that is, to determine what the person is allowed to see or do within a given system. Successful privacy and security for businesses, organizations, and governments require the enforcement of rigorous access controls. Who can see certain data is not always the same as the person authorized to change or manipulate it. The person authorized to manipulate or process it may not be entitled to share it or delete it. Successfully setting and enforcing these controls is one of the most challenging features for any organization which collects and uses personal data.
While the first three steps in digital identity systems exist in various forms already, a truly universal digital identity is likely to be successful at a mass scale only if it is federated, meaning that the ID must be usable across institutional, sectoral, and geographic boundaries. A federated identity system would be the most significant departure from every account-specific login or access process that exists today. To accomplish such wide-ranging compatibility will require a common set of open standards that institutions, sectors, and countries establish collaboratively and implement globally. A digital wallet will need to seamlessly grant access across many networks, from a movie theater verifying over-17 aged entrants, banks processing loan applications, hospitals establishing patient status and access records, airports for boarding, or amusement parks and stadiums providing scheduled performances and perks.
Global banking and financial services are leading the way on this sort of broad implementation. Therefore, online banking is a constructive digital ID use case:
Authenticate – yes, this is Samir,
Verify – Samir has an account at this bank,
Authorize – Samir has a credit line authorizing him to borrow, which he accesses and e-signs, and
Federate – Samir’s mortgage lender accepts the money transfer from the bank via the digital ID platform.
Banks are motivated to forge ahead on such digital identity systems to improve fraud detection, streamline “know your customer” compliance processes, increase their ability to stop money-laundering and other finance-related crimes, and offer superior customer experiences. But by creating secure, standardized digital identity access for online banking, they may also offer engagement to the large portions of the globe that are currently un- or under-banked, and/or who have minimal governmental infrastructure around legal identity systems.
The Challenges and Opportunities
Privacy, security, equity, data abuse, and user control all raise unique challenges and opportunities for digital ID.
Digital identity, if not deployed correctly, may undermine privacy rights. If not implemented responsibility, and carefully controlled with both technical and legal safeguards, digital IDs might allow for increased location-tracking and user profiling, already a concern with cell phone technology. Blockchain technology, if not designed carefully, creates a public, immutable record of information exchanges, regarding where, when, and why a digital ID was requested. And a given digital ID provider may have too much power, with the ability to suppress ID holders from accessing their digital accounts. However, digital IDs also offer the possibility of increased privacy protection if systems are effectively designed to share only the minimum, necessary information, and identification is only established up to the level necessary for the particular exchange or transaction. “Privacy by design,” as well as appropriate defaults and system controls, can prohibit any of the network participants, including the operator, from complete access to the users’ transactions, particularly if accompanied by appropriate legislative or regulatory boundaries.
Digital ID likewise has both pros and cons for security. While not perfect, Digital IDs are generally harder to lose or be counterfeited than a physical document; and offer significantly greater security than an individual’s hundreds of separate login credentials across sites of uncertain levels of protection. However, poor adherence to best practices may result in a centralized location of personal and sensitive information, which may become a more appealing target for hackers, and increase the risk of a mass compromise of information. Centralized databases can be minimized by local storage of authenticating factors like biometrics, and distributed storage of other data with appropriate security measures and controls.
Inequities can occur along a number of different axes. Since digital identity designs may reflect society’s biases, it is important to mandate and continually measure inclusion and performance. For instance, the UK’s digital ID framework requires the ID issuers to submit an annual exclusion report. In addition, because not everyone has a smartphone or internet access, digital IDs risk increasing inequities among those with limited connectivity. Without reliable digital access, groups that have traditionally struggled may continue to lack the privileges that digital IDs promise to provide. On the other hand, according to the World Bank, an estimated 1.1 billion people worldwide cannot officially prove or establish their legal identity. In countries or situations without clear legal processes, or lacking information infrastructures, digital identity systems have the potential to provide people who do have smartphones or internet access the ability to receive healthcare, education, finance, and other essential services. Even those without access to a digital device could use a secure physical form, like a barcode, to maintain their digital identity.
Policy Impacts and Conclusion
Individuals are used to the ability to easily control the use of their physical documents. When you hand your passport to a TSA agent, you observe who is seeing it and how it is being used. A digital ID holder will need these same assurances, understanding, and trust. Therefore, ideally, users should be able to identify every instance that their identity was accessed by a vendor. Early systems, like the Australian Digital License App, give citizens some control over their credentials by enabling users themselves to specify the information to share or display. Legislative bodies and regulatory agencies designing or controlling such systems should work closely with industry representatives, security experts, consumer protection organizations, civil rights advocates, and other stakeholders to ensure fair systems are established and monitored appropriately.
Transparency of development, and public adoption processes and procurement systems will be vital to public trust in any such systems, whether privately or publicly operated. In some cases, such systems may even help educate and increase awareness among users of the information that is already collected and held about them, and where and how it is being used, as well as make it easier for them to exert control easily for the necessary sharing of their information.
Digital identification, integrated to greater or lesser degrees, seems an almost inevitable next step in our digital lives, and overall offers promising opportunities to improve our access and controls over the information already spanning the internet about us. But it is crucial that moving forward, digital ID systems are responsibly designed, implemented, and regulated to ensure the necessary privacy and security standards, as well as prevent the abuse of individuals or the perpetuation of inequities against vulnerable populations. While there are important cautions, digital identity has the potential to transform the way we interact with the world, as our “selves” take on new dimensions and opportunities.