Advanced algorithms, machine learning (ML), and artificial intelligence (AI) are appearing across digital and technology sectors from healthcare to financial institutions, and in contexts ranging from voice-activated digital assistants, to traffic routing, identifying at-risk students, and getting purchase recommendations on various online platforms. Embedded in new technologies like autonomous cars and smart phones to enable cutting edge features, AI is equally being applied to established industries such as agriculture and telecomm to increase accuracy and efficiency. We see already that machine learning is becoming the foundation of many of the products and services in our daily lives, the underlying structure in much the same way that electricity faded from novelty to background during the industrialization of modern life 100 years ago.
Understanding AI and its underlying algorithmic processes presents new challenges for privacy officers and others responsible for data governance in companies ranging from retailers to cloud service providers. In the absence of targeted legal or regulatory obligations, AI poses new ethical and practical challenges for companies that strive to maximize consumer benefits while preventing potential harms.
Along with the benefits from the increased use of artificial intelligence and machine learning models underlying new technology, we also have seen public examples of the ways in which these algorithms can reflect some of the most glaring biases within society. From chatbots that “learn” to be racist, policing algorithms with questionable results, and cameras which do not recognize people of certain races, the past few years have shown that AI is not immune to problems of discrimination and bias. AI however, also has many potential benefits, including promising outlooks for the disability community and the increased accuracy of diagnosis and other applications to improve healthcare. The incredible potential of AI means that it is important to address concerns around its implementation in order to ensure consumer trust and safety. The problems of bias or fairness in ML systems are a key challenge in achieving that reliability. This issue is complex – fairness is not a fixed concept. What is “fair” by one measure might not be equitable in another. While many industry leaders have identified controlling bias as a goal in their published AI policies, there is no consensus on exactly how this can be achieved.
In one of the most notable cases of apparent AI bias, ProPublica published a report in which they claimed an algorithm, designed to predict the likelihood a defendant would reoffend, displayed racial bias. The algorithm assigned a score from 1 to 10, claiming to offer an assessment of the risk that a given defendant would go on to reoffend. This number was then often used as a factor in determining eligibility for bail. Notably, “race” was not amongst the various inputs which were used in determining the risk level. However, in their report, ProPublica found that among defendants who went on to not reoffend, black defendants were more than twice as likely as white defendants to have received a mid- or high-risk score. ProPublica correctly highlighted the unfairness of such disparate outcomes, but the issue of whether the scores were simply racially biased, it turns out, is more complicated.
The algorithm had been calibrated to ensure the risk level of reoffending “meant the same thing” from one defendant to another. Thus, of the defendants who were given a level 7, 60% of white defendants and 61% of black defendants went on to reoffend – a statistically similar outcome. However, in designing the program to achieve this level of equity (a “7” means ~60% chance of reoffending, across the board) means that the program forced distribution between low, mid, and high-risk categories in a way that resulted in more Black defendants receiving a higher score. There is no mathematical way to equalize both of these measures at the same time, within the same model. Data scientists have shown that multiple measures of “fairness” may be impossible to achieve simultaneously.
As importantly, the implementation of these scores by the humans within the system is impossible to quantify. There is no way to ensure that the score for one defendant will factored in by the judge in the same way as the score for another. Because of this tension, it is important that AI and ML designers and providers are transparent in their interpretation of fairness – what factors are considered, how they’re weighted, and how they interact – and that they sufficiently educate their customers in what their technology does or does not do. This is of special importance when operating in such sensitive fields as the criminal justice system, financial services, or other applications with legally significant impacts on individual customers.
However, even companies whose systems are outside such highly charged environments must remain cognizant of the potential for bias and discrimination. In 2016 the “first international beauty system judged by machines” premiered. The program was supposed to select a few faces which “most closely resembled human beauty” from a selection of over 6,000 entries. It overwhelmingly selected white faces. This is almost certainly because the training data or test data sets included more white faces than others. Or that the datasets more often had images of white faces associated with “beauty” or “beautiful” in some context. Thus, the algorithmic model “learned” that one of the factors contributing to the conclusion “beautiful” was “whiteness.”
Many types of Machine Learning, including deep learning, mean that the exact processing by which an algorithm makes a recommendation is ultimately unclear, even to its programmers. It is therefore all the more important to be able to evaluate outcomes objectively, testing for patterns or trends that demonstrate an undesirable bias. There is no such thing as a system without bias. Instead, a commitment to fairness means designing a system that can be evaluated for illegal, discriminatory, or simply undesirable outcomes. Algorithms trained on existing data from historically human systems will mirror some level of human bias – the goal should be to establish baseline practices for how to manage or mitigate this risk.
The most basic requirement is ensuring that the data sets the system is trained and tested on are appropriately representative. The chief science officer of the AI beauty contest mentioned above confirmed that one of the issues with the algorithm was that it was not trained with a sufficient sample size of non-white faces. In a training landscape where one specific race is more highly correlated with the idea of “beauty”, the algorithm will reflect this bias in its outputs. (For example, systems developed in Asia better distinguish Asian faces over white ones while the opposite is true for systems developed in the United States.)
Similarly, in law enforcement, training datasets are likely to reflect the historic disproportionate incarceration of non-white populations, and will reach outputs that reflect those systemic biases. However, identifying the potential flaws in datasets can be difficult – there are biases less obvious than those affiliated with race, gender, or other high-visibility factors. Unconscious or unintended bias can be present in less obvious ways, so AI/ML developers must have processes in place to preempt, prevent, or correct such occurrences.
Strategies include responding to research that shows that ensuring the humans behind the algorithm are sufficiently diverse can make a significant impact. Studies have shown that the racial and cultural diversity of the creators of facial recognition software influences the accuracy of the system. This implies that who trains the systems is an important consideration. By promoting diversity within their workforces, companies are also more likely to increase the accuracy and value of their systems.
Finally, there are statistical tools – additional mathematical models – that can be used to systematically evaluate program outputs as a way of measuring the validity of their recommendations. These auditing programs are a way of leveraging more math to evaluate the existing process in ways that exceed what human evaluators might be able to identify.
Companies – both those who develop these technologies, and their customers who implement them in different areas – have a responsibility to use all the tools in their power to address the issues of bias in their Machine Learning models. From policy requirements, to development guidance, hiring diversity and sufficient training, they must be able to assure their customers that the products and services based on ML models are sufficiently equitable for their particular application.
The unique features of AI and ML include not just big data’s defining characteristic of tremendous amounts of data, but the additional uses, and most importantly, the multi-layered processing models developed to harness and operationalize that data. AI-driven applications offer beneficial services and research opportunities, but pose potential harms to individuals and groups when not implemented with a clear focus on controlling for, and managing, bias. The scope of impact of these systems means it is critical that associated concerns are addressed early in the design cycle, as lock-in effects make it more difficult to later modify harmful design choices. The design must include the long-term monitoring and review functions, as these systems are literally built to morph and adapt over time. As AI and ML programs are applied across new and existing industries, platforms, and applications, policymakers and corporate privacy officers will want to ensure that the programs they design and implement provide the full benefits of this advancing technology, while controlling for, and avoiding, the negative impacts of unfair outputs, with the ultimate goal that all individuals are treated with respect and dignity.
By Maria Nava, and Brenda Leong