Chapter 1 & 2 PDF
Document Details
Tags
Summary
This document introduces machine learning and its use in decision-making. It also describes the importance of fairness and the potential biases within such systems. The text explores how various factors, both social and technological, can influence the outcomes of these decisions.
Full Transcript
1 Introduction Our success, happiness, and wellbeing are never fully of our own making. Others’ decisions can profoundly affect the course of our lives: whether to admit us to a particular school, offer us a job, or grant us a mortgage. Arbitrary, inconsistent, or faulty decision-making thus raises...
1 Introduction Our success, happiness, and wellbeing are never fully of our own making. Others’ decisions can profoundly affect the course of our lives: whether to admit us to a particular school, offer us a job, or grant us a mortgage. Arbitrary, inconsistent, or faulty decision-making thus raises serious concerns because it risks limiting our ability to achieve the goals that we have set for ourselves and access the opportunities for which we are qualified. So how do we ensure that these decisions are made the right way and for the right reasons? While there’s much to value in fixed rules, applied consistently, good decisions take available evidence into account. We expect admissions, employment, and lending decisions to rest on factors that are relevant to the outcome of interest. Identifying details that are relevant to a decision might happen informally and without much thought: employers might observe that people who study math seem to perform particularly well in the financial industry. But they could test these observations against historical evidence by examining the degree to which one’s major correlates with success on the job. This is the traditional work of statistics—and it promises to provide a more reliable basis for decision-making by quantifying how much weight to assign certain details in our determinations. A body of research has compared the accuracy of statistical models to the judgments of humans, even experts with years of experience. In many head-to- head comparisons on fixed tasks, data-driven decisions are more accurate than those based on intuition or expertise. As one example, in a 2002 study, automated underwriting of loans was both more accurate and less racially disparate.1 These results have been welcomed as a way to ensure that the high-stakes decisions that shape our life chances are both accurate and fair. Machine learning promises to bring greater discipline to decision-making because it offers to uncover factors that are relevant to decision-making that humans might overlook, given the complexity or subtlety of the relationships in historical evidence. Rather than starting with some intuition about the relationship between certain factors and an outcome of interest, machine learning lets us defer the question of relevance to the data themselves: which factors—among all that we have observed—bear a statistical relationship to the outcome. Uncovering patterns in historical evidence can be even more powerful than this might seem to suggest. Breakthroughs in computer vision—specifically object 1 recognition—reveal just how much pattern-discovery can achieve. In this domain, machine learning has helped to overcome a strange fact of human cognition: while we may be able to effortlessly identify objects in a scene, we are unable to specify the full set of rules that we rely upon to make these determinations. We cannot hand code a program that exhaustively enumerates all the relevant factors that allow us to recognize objects from every possible perspective or in all their potential visual configurations. Machine learning aims to solve this problem by abandoning the attempt to teach a computer through explicit instruction in favor of a process of learning by example. By exposing the computer to many examples of images containing pre-identified objects, we hope the computer will learn the patterns that reliably distinguish different objects from one another and from the environments in which they appear. This can feel like a remarkable achievement, not only because computers can now execute complex tasks but also because the rules for deciding what appears in an image seem to emerge from the data themselves. But there are serious risks in learning from examples. Learning is not a process of simply committing examples to memory. Instead, it involves generalizing from examples: honing in on those details that are characteristic of (say) cats in general, not just the specific cats that happen to appear in the examples. This is the process of induction: drawing general rules from specific examples—rules that effectively account for past cases, but also apply to future, as yet unseen cases, too. The hope is that we’ll figure out how future cases are likely to be similar to past cases, even if they are not exactly the same. This means that reliably generalizing from historical examples to future cases requires that we provide the computer with good examples: a sufficiently large number of examples to uncover subtle patterns; a sufficiently diverse set of ex- amples to showcase the many different types of appearances that objects might take; a sufficiently well-annotated set of examples to furnish machine learning with reliable ground truth; and so on. Thus, evidence-based decision-making is only as reliable as the evidence on which it is based, and high quality examples are critically important to machine learning. The fact that machine learning is “evidence-based” by no means ensures that it will lead to accurate, reliable, or fair decisions. This is especially true when using machine learning to model human behavior and characteristics. Our historical examples of the relevant outcomes will almost always reflect historical prejudices against certain social groups, prevailing cultural stereotypes, and existing demographic inequalities. And finding patterns in these data will often mean replicating these very same dynamics. Something else is lost in moving to automated, predictive decision making. Human decision makers rarely try to maximize predictive accuracy at all costs; frequently, they might consider factors such as whether the attributes used for prediction are morally relevant. For example, although younger defendants are statistically more likely to re-offend, judges are loath to take this into account in deciding sentence lengths, viewing younger defendants as less morally culpable. This is one reason to be cautious of comparisons seemingly showing the superiority 2 of statistical decision making.2 Humans are also unlikely to make decisions that are obviously absurd, but this could happen with automated decision making, perhaps due to erroneous data. These and many other differences between human and automated decision making are reasons why decision making systems that rely on machine learning might be unjust. We write this book as machine learning begins to play a role in especially consequential decision-making. In the criminal justice system, as alluded to above, defendants are assigned statistical scores that are intended to predict the risk of committing future crimes, and these scores inform decisions about bail, sentencing, and parole. In the commercial sphere, firms use machine learning to analyze and filter resumes of job applicants. And statistical methods are of course the bread and butter of lending, credit, and insurance underwriting. We now begin to survey the risks in these and many other applications of machine learning, and provide a critical review of an emerging set of proposed solutions. We will see how even well-intentioned applications of machine learning might give rise to objectionable results. Demographic disparities Amazon uses a data-driven system to determine the neighborhoods in which to offer free same-day delivery. A 2016 investigation found stark disparities in the demographic makeup of these neighborhoods: in many U.S. cities, White residents were more than twice as likely as Black residents to live in one of the qualifying neighborhoods.3 Now, we don’t know the details of how Amazon’s system works, and in particular we don’t know to what extent it uses machine learning. The same is true of many other systems reported on in the press. Nonetheless, we’ll use these as motivating examples when a machine learning system for the task at hand would plausibly show the same behavior. In Chapter 3 we’ll see how to make our intuition about demographic dispar- ities mathematically precise, and we’ll see that there are many possible ways of measuring these inequalities. The pervasiveness of such disparities in machine learning applications is a key concern of this book. When we observe disparities, it doesn’t imply that the designer of the system intended for such inequalities to arise. Looking beyond intent, it’s important to understand when observed disparities can be considered to be discrimination. In turn, two key questions to ask are whether the disparities are justified and whether they are harmful. These questions rarely have simple answers, but the extensive literature on discrimination in philosophy and sociology can help us reason about them. To understand why the racial disparities in Amazon’s system might be harmful, we must keep in mind the history of racial prejudice in the United States, its relationship to geographic segregation and disparities, and the perpetuation of those inequalities over time. Amazon argued that its system was justified because 3 it was designed based on efficiency and cost considerations and that race wasn’t an explicit factor. Nonetheless, it has the effect of providing different opportunities to consumers at racially disparate rates. The concern is that this might contribute to the perpetuation of long-lasting cycles of inequality. If, instead, the system had been found to be partial to ZIP codes ending in an odd digit, it would not have triggered a similar outcry. The term bias is often used to refer to demographic disparities in algorithmic systems that are objectionable for societal reasons. We’ll minimize the use of this sense of the word bias in this book, since different disciplines and communities understand the term differently, and this can lead to confusion. There’s a more traditional use of the term bias in statistics and machine learning. Suppose that Amazon’s estimates of delivery dates/times were consistently too early by a few hours. This would be a case of statistical bias. A statistical estimator is said to be biased if its expected or average value differs from the true value that it aims to estimate. Statistical bias is a fundamental concept in statistics, and there is a rich set of established techniques for analyzing and avoiding it. There are many other measures that quantify desirable statistical properties of a predictor or an estimator, such as precision, recall, and calibration. These are similarly well understood; none of them require any knowledge of social groups and are relatively straightforward to measure. The attention to demographic criteria in statistics and machine learning is a relatively new direction. This reflects a change in how we conceptualize machine learning systems and the responsibilities of those building them. Is our goal to faithfully reflect the data? Or do we have an obligation to question the data, and to design our systems to conform to some notion of equitable behavior, regardless of whether or not that’s supported by the data currently available to us? These perspectives are often in tension, and the difference between them will become clearer when we delve into stages of machine learning. The machine learning loop Let’s study the pipeline of machine learning and understand how demographic disparities propagate through it. This approach lets us glimpse into the black box of machine learning and will prepare us for the more detailed analyses in later chapters. Studying the stages of machine learning is crucial if we want to intervene to minimize disparities. The figure below shows the stages of a typical system that produces outputs using machine learning. Like any such diagram, it is a simplification, but it is useful for our purposes. The first stage is measurement, which is the process by which the state of the world is reduced to a set of rows, columns, and values in a dataset. It’s a messy process, because the real world is messy. The term measurement is misleading, evoking an image of a dispassionate scientist recording what she observes, whereas we’ll see that it requires subjective human decisions. 4 Figure 1: The machine learning loop The ‘learning’ in machine learning refers to the next stage, which is to turn that data into a model. A model summarizes the patterns in the training data; it makes generalizations. A model could be trained using supervised learning via an algorithm such as Support Vector Machines, or using unsupervised learning via an algorithm such as k-means clustering. It could take many forms: a hyperplane or a set of regions in n-dimensional space, or a set of distributions. It is typically represented as a set of weights or parameters. The next stage is the action we take based on the model’s predictions, which are applications of the model to new, unseen inputs. By the way, ‘prediction’ is another misleading term—while it does sometimes involve trying to predict the future (“is this patient at high risk for cancer?”), sometimes it doesn’t (“is this social media account a bot?”). Prediction can take the form of classification (determine whether a piece of email is spam), regression (assigning risk scores to defendants), or information retrieval (finding documents that best match a search query). The actions in these three applications might be: depositing the email in the user’s inbox or spam folder, deciding whether to set bail for the defendant’s pre- trial release, and displaying the retrieved search results to the user. They may differ greatly in their significance to the individual, but they have in common that the collective responses of individuals to these decisions alter the state of the world—that is, the underlying patterns that the system aims to model. Some machine learning systems record feedback from users (how users react to actions) and use them to refine the model. For example, search engines track what users click on as an implicit signal of relevance or quality. Feedback can also occur unintentionally, or even adversarially; these are more problematic, as we’ll explore later in this chapter. The state of society In this book, we’re concerned with applications of machine learning that involve data about people. In these applications, the available training data will likely encode the demographic disparities that exist in our society. For example, the 5 Figure 2: A sample of occupations in the United States in decreasing order of the percentage of women. The area of the bubble represents the number of workers. figure shows the gender breakdown of a sample of occupations in the United States, based on data released by the Bureau of Labor Statistics for the year 2017. Unsurprisingly, many occupations have stark gender imbalances. If we’re building a machine learning system that screens job candidates, we should be keenly aware that this is the baseline we’re starting from. It doesn’t necessarily mean that the outputs of our system will be inaccurate or discriminatory, but throughout this chapter we’ll see how it complicates things. Why do these disparities exist? There are many potentially contributing factors, including a history of explicit discrimination, implicit attitudes and stereotypes about gender, and differences in the distribution of certain characteristics by gender. We’ll see that even in the absence of explicit discrimination, stereotypes can be self- fulfilling and persist for a long time in society. As we integrate machine learning into decision-making, we should be careful to ensure that ML doesn’t become a part of this feedback loop. What about applications that aren’t about people? Consider “Street Bump,” a project by the city of Boston to crowdsource data on potholes. The smartphone app automatically detects potholes using data from the smartphone’s sensors and sends the data to the city. Infrastructure seems like a comfortably boring application of data-driven decision-making, far removed from the ethical quandaries we’ve been discussing. And yet! Kate Crawford points out that the data reflect the patterns of smartphone ownership, which are higher in wealthier parts of the city compared to lower-income areas and areas with large elderly populations.4 The lesson here is that it’s rare for machine learning applications to not be about people. In the case of Street Bump, the data is collected by people, and hence reflects demographic disparities; besides, the reason we’re interested in improving infrastructure in the first place is its effect on people’s lives. To drive home the point that most machine learning applications involve people, we analyzed Kaggle, a well-known platform for data science competitions. 6 We focused on the top 30 competitions sorted by prize amount. In 14 of these competitions, we observed that the task is to make decisions about individuals. In most of these cases, there exist societal stereotypes or disparities that may be perpetuated by the application of machine learning. For example, the Automated Essay Scoring5 task seeks algorithms that attempt to match the scores of human graders of student essays. Students’ linguistic choices are signifiers of social group membership, and human graders are known to sometimes have prejudices based on such factors.6, 7 Thus, because human graders must provide the original labels, automated grading systems risk enshrining any such discriminatory patterns that are captured in the training data. In a further 5 of the 30 competitions, the task did not call for making decisions about people, but decisions made using the model would nevertheless directly impact people. For example, one competition sponsored by real-estate company Zillow calls for improving the company’s “Zestimate” algorithm for predicting home sale prices. Any system that predicts a home’s future sale price (and publicizes these predictions) is likely to create a self-fulfilling feedback loop in which homes predicted to have lower sale prices deter future buyers, suppressing demand and lowering the final sale price. In 9 of the 30 competitions, we did not find an obvious, direct impact on people, such as a competition on predicting ocean health (of course, even such competitions have indirect impacts on people, due to actions that we might take on the basis of the knowledge gained). In two cases, we didn’t have enough information to make a determination. To summarize, human society is full of demographic disparities, and training data will likely reflect these. We’ll now turn to the process by which training data is constructed, and see that things are even trickier. The trouble with measurement The term measurement suggests a straightforward process, calling to mind a camera objectively recording a scene. In fact, measurement is fraught with subjective decisions and technical difficulties. Consider a seemingly straightforward task: measuring the demographic di- versity of college campuses. A 2017 New York Times article aimed to do just this, and was titled “Even With Affirmative Action, Blacks and Hispanics Are More Underrepresented at Top Colleges Than 35 Years Ago”.8 The authors argue that the gap between enrolled Black and Hispanic freshmen and the Black and Hispanic college-age population has grown over the past 35 years. To support their claim, they present demographic information for more than 100 American universities and colleges from the year 1980 to 2015, and show how the percentages of Black, Hispanic, Asian, White, and multiracial students have changed over the years. Interestingly, the multiracial category was only recently introduced in 2008, but the comparisons in the article ignore the introduction of this new category. How many students who might have checked the “White” or “Black” box checked 7 the “multiracial” box instead? How might this have affected the percentages of “White” and “Black” students at these universities? Furthermore, individuals’ and society’s conception of race changes over time. Would a person with Black and Latino parents be more inclined to self-identify as Black in 2015 than in the 1980s? The point is that even a seemingly straightforward question about trends in demographic diversity is impossible to answer without making some assumptions, and illustrates the difficulties of measurement in a world that resists falling neatly into a set of checkboxes. Race is not a stable category; how we measure race often changes how we conceive of it, and changing conceptions of race may force us to alter what we measure. To be clear, this situation is typical: measuring almost any attribute about people is similarly subjective and challenging. If anything, things are more chaotic when machine learning researchers have to create categories, as is often the case. One area where machine learning practitioners often have to define new cate- gories is in defining the target variable.9 This is the outcome that we’re trying to predict – will the defendant recidivate if released on bail? Will the candidate be a good employee if hired? And so on. Biases in the definition of the target variable are especially critical, because they are guaranteed to bias the predictions relative to the actual construct we intended to predict, as is the case when we use arrests as a measure of crime, or sales as a measure of job performance, or GPA as a measure of academic success. This is not necessarily so with other attributes. But the target variable is arguably the hardest from a measurement standpoint, because it is often a construct that is made up for the purposes of the problem at hand rather than one that is widely understood and measured. For example, “creditworthiness” is a construct that was created in the context of the problem of how to successfully extend credit to consumers;9 it is not an intrinsic property that people either possess or lack. If our target variable is the idea of a “good employee”, we might use perfor- mance review scores to quantify it. This means that our data inherits any biases present in managers’ evaluations of their reports. Another example: the use of computer vision to automatically rank people’s physical attractiveness.10, 11 The training data consists of human evaluation of attractiveness, and, unsurprisingly, all these classifiers showed a preference for lighter skin. In some cases we might be able to get closer to a more objective definition for a target variable, at least in principle. For example, in criminal risk assessment, the training data is not judges’ decisions about bail, but rather based on who actually went on to commit a crime. But there’s at least one big caveat—we can’t really measure who committed a crime, so we use arrests as a proxy. This means that the training data contain distortions not due to the prejudices of judges but due to discriminatory policing. On the other hand, if our target variable is whether the defendant appears or fails to appear in court for trial, we would be able to measure it directly with perfect accuracy. That said, we may still have concerns about a system that treats defendants differently based on predicted probability of appearance, given that some reasons for failing to appear are less objectionable than others (trying to hold down a job that would not allow for time off versus 8 trying to avoid prosecution).12 In hiring, instead of relying on performance reviews for (say) a sales job, we might rely on the number of sales closed. But is that an objective measurement or is it subject to the prejudices of the potential customers (who might respond more positively to certain salespeople than others) and workplace conditions (which might be a hostile environment for some, but not others)? In some applications, researchers repurpose an existing scheme of classification to define the target variable rather than creating one from scratch. For example, an object recognition system can be created by training a classifier on ImageNet, a database of images organized in a hierarchy of concepts.13 ImageNet’s hierarchy comes from Wordnet, a database of words, categories, and the relationships among them.14 Wordnet’s authors in turn imported the word lists from a number of older sources, such as thesauri. As a result, WordNet (and ImageNet) categories contain numerous outmoded words and associations, such as occupations that no longer exist and stereotyped gender associations.15 We think of technology changing rapidly and society being slow to adapt, but at least in this instance, the categorization scheme at the heart of much of today’s machine learning technology has been frozen in time while social norms have changed. Our favorite example of measurement bias has to do with cameras, which we referenced at the beginning of the section as the exemplar of dispassionate observation and recording. But are they? The visual world has an essentially infinite bandwidth compared to what can be captured by cameras, whether film or digital, which means that photography technology involves a series of choices about what is relevant and what isn’t, and transformations of the captured data based on those choices. Both film and digital cameras have historically been more adept at photographing lighter-skinned individuals.16 One reason is the default settings such as color balance which were optimized for lighter skin tones. Another, deeper reason is the limited “dynamic range” of cameras, which makes it hard to capture brighter and darker tones in the same image. This started changing in the 1970s, in part due to complaints from furniture companies and chocolate companies about the difficulty of photographically capturing the details of furniture and chocolate respectively! Another impetus came from the increasing diversity of television subjects at this time. When we go from individual images to datasets of images, we introduce another layer of potential biases. Consider the image datasets that are used to train today’s computer vision systems for tasks such as object recognition. If these datasets were representative samples of an underlying visual world, we might expect that a computer vision system trained on one such dataset would do well on another dataset. But in reality, we observe a big drop in accuracy when we train and test on different datasets.17 This shows that these datasets are biased relative to each other in a statistical sense, and is a good starting point for investigating whether these biases include cultural stereotypes. It’s not all bad news: machine learning can in fact help mitigate measure- 9 ment biases. Returning to the issue of dynamic range in cameras, computational techniques, including machine learning, are making it possible to improve the representation of tones in images.18, 19, 20 Another example comes from medicine: diagnoses and treatments are sometimes personalized by race. But it turns out that race is used as a crude proxy for ancestry and genetics, and sometimes environ- mental and behavioral factors.21, 22 If we can measure the factors that are medically relevant and incorporate them—instead of race—into statistical models of disease and drug response, we can increase the accuracy of diagnoses and treatments while mitigating racial disparities. To summarize, measurement involves defining variables of interest, the process for interacting with the real world and turning observations into numbers, and then actually collecting the data. Often machine learning practitioners don’t think about these steps, because someone else has already done those things. And yet it is crucial to understand the provenance of the data. Even if someone else has collected the data, it’s almost always too messy for algorithms to handle, hence the dreaded “data cleaning” step. But the messiness of the real world isn’t just an annoyance to be dealt with by cleaning. It is a manifestation of a diverse world in which people don’t fit neatly into categories. Being inattentive to these nuances can particularly hurt marginalized populations. From data to models We’ve seen that training data reflects the disparities, distortions, and biases from the real world and the measurement process. This leads to an obvious question: when we learn a model from such data, are these disparities preserved, mitigated, or exacerbated? Predictive models trained with supervised learning methods are often good at calibration: ensuring that the model’s prediction subsumes all features in the data for the purpose of predicting the outcome. But calibration also means that by default, we should expect our models to faithfully reflect disparities found in the input data. Here’s another way to think about it. Some patterns in the training data (smok- ing is associated with cancer) represent knowledge that we wish to mine using machine learning, while other patterns (girls like pink and boys like blue) represent stereotypes that we might wish to avoid learning. But learning algorithms have no general way to distinguish between these two types of patterns, because they are the result of social norms and moral judgments. Absent specific intervention, machine learning will extract stereotypes, including incorrect and harmful ones, in the same way that it extracts knowledge. A telling example of this comes from machine translation. The screenshot on the right shows the result of translating sentences from English to Turkish and back.23 The same stereotyped translations result for many pairs of languages and other occupation words in all translation engines we’ve tested. It’s easy to see why. Turkish has gender neutral pronouns, and when translating such a pronoun 10 Figure 3: Translating from English to Turkish, then back to English injects gender stereotypes. to English, the system picks the sentence that best matches the statistics of the training set (which is typically a large, minimally curated corpus of historical text and text found on the web). When we build a statistical model of language from such text, we should expect the gender associations of occupation words to roughly mirror real-world labor statistics. In addition, because of the male-as-norm bias24 (the use of male pronouns when the gender is unknown) we should expect translations to favor male pronouns. It turns out that when we repeat the experiment with dozens of occupation words, these two factors—labor statistics and the male-as-norm bias—together almost perfectly predict which pronoun will be returned.23 Here’s a tempting response to the observation that models reflect data biases. Suppose we’re building a model for scoring resumes for a programming job. What if we simply withhold gender from the data? Is that a sufficient response to concerns about gender discrimination? Unfortunately, it’s not that simple, because of the problem of proxies9 or redundant encodings,25 as we’ll discuss in Chapter 3. There are any number of other attributes in the data that might correlate with gender. For example, in our society, the age at which someone starts programming is correlated with gender. This illustrates why we can’t just get rid of proxies: they may be genuinely relevant to the decision at hand. How long someone has been programming is a factor that gives us valuable information about their suitability for a programming job, but it also reflects the reality of gender stereotyping. Another common reason why machine learning might perform worse for some groups than others is sample size disparity. If we construct our training set by sampling uniformly from the training data, then by definition we’ll have fewer data 11 points about minorities. Of course, machine learning works better when there’s more data, so it will work less well for members of minority groups, assuming that members of the majority and minority groups are systematically different in terms of the prediction task.25 Worse, in many settings minority groups are underrepresented relative to population statistics. For example, minority groups are underrepresented in the tech industry. Different groups might also adopt technology at different rates, which might skew datasets assembled form social media. If training sets are drawn from these unrepresentative contexts, there will be even fewer training points from minority individuals. When we develop machine-learning models, we typically only test their overall accuracy; so a “5% error” statistic might hide the fact that a model performs terribly for a minority group. Reporting accuracy rates by group will help alert us to problems like the above example. In Chapter 3, we’ll look at metrics that quantify the error-rate disparity between groups. There’s one application of machine learning where we find especially high error rates for minority groups: anomaly detection. This is the idea of detecting behavior that deviates from the norm as evidence of abuse against a system. A good example is the Nymwars controversy, where Google, Facebook, and other tech companies aimed to block users who used uncommon (hence, presumably fake) names. Further, suppose that in some cultures, most people receive names from a small set of names, whereas in other cultures, names might be more diverse, and it might be common for names to be unique. For users in the latter culture, a popular name would be more likely to be fake. In other words, the same feature that constitutes evidence towards a prediction in one group might constitute evidence against the prediction for another group.25 If we’re not careful, learning algorithms will generalize based on the majority culture, leading to a high error rate for minority groups. Attempting to avoid this by making the model more complex runs into a different problem: overfitting to the training data, that is, picking up patterns that arise due to random noise rather than true differences. One way to avoid this is to explicitly model the differences between groups, although there are both technical and ethical challenges associated with this. The pitfalls of action Any real machine-learning system seeks to make some change in the world. To understand its effects, then, we have to consider it in the context of the larger socio-technical system in which it is embedded. In Chapter 3, we’ll see that if a model is calibrated—it faithfully captures the patterns in the underlying data—predictions made using that model will inevitably have disparate error rates for different groups, if those groups have different base rates, that is, rates of positive or negative outcomes. In other words, understanding 12 the properties of a prediction requires understanding not just the model, but also the population differences between the groups on which the predictions are applied. Further, population characteristics can shift over time; this is a well-known machine learning phenomenon known as drift. If sub-populations change differ- ently over time, but the model isn’t retrained, that can introduce disparities. An additional wrinkle: whether or not disparities are objectionable may differ between cultures, and may change over time as social norms evolve. When people are subject to automated decisions, their perception of those decisions depends not only on the outcomes but also the process of decision- making. An ethical decision-making process might require, among other things, the ability to explain a prediction or decision, which might not be feasible with black-box models. A major limitation of machine learning is that it only reveals correlations, but we often use its predictions as if they reveal causation. This is a persistent source of problems. For example, an early machine learning system in healthcare famously learned the seemingly nonsensical rule that patients with asthma had lower risk of developing pneumonia. This was a true pattern in the data, but the likely reason was that asthmatic patients were more likely to receive in-patient care.26 So it’s not valid to use the prediction to decide whether or not to admit a patient. We’ll discuss causality in Chapter 5. Another way to view this example is that the prediction affects the outcome (because of the actions taken on the basis of the prediction), and thus invalidates itself. The same principle is also seen in the use of machine learning for predicting traffic congestion: if sufficiently many people choose their routes based on the prediction, then the route predicted to be clear will in fact be congested. The effect can also work in the opposite direction: the prediction might reinforce the outcome, resulting in feedback loops. To better understand how, let’s talk about the final stage in our loop: feedback. Feedback and feedback loops Many systems receive feedback when they make predictions. When a search engine serves results, it typically records the links that the user clicks on and how long the user spends on those pages, and treats these as implicit signals about which results were found to be most relevant. When a video sharing website recommends a video, it uses the thumbs up/down feedback as an explicit signal. Such feedback is used to refine the model. But feedback is tricky to interpret correctly. If a user clicked on the first link on a page of search results, is that simply because it was first, or because it was in fact the most relevant? This is again a case of the action (the ordering of search results) affecting the outcome (the link(s) the user clicks on). This is an active area of research; there are techniques that aim to learn accurately from this kind of biased feedback.27 13 Bias in feedback might also reflect cultural prejudices, which is of course much harder to characterize than the effects of the ordering of search results. For example, the clicks on the targeted ads that appear alongside search results might reflect gender and racial stereotypes. There’s a well-known study by Latanya Sweeney that hints at this: Google searches for Black-sounding names such as “Latanya Farrell” were much more likely to results in ads for arrest records (“Latanya Farrell, Arrested?”) than searches for White-sounding names (“Kristen Haring”).28 One potential explanation is that users are more likely to click on ads that conform to stereotypes, and the advertising system is optimized for maximizing clicks. In other words, even feedback that’s designed into systems can lead to unex- pected or undesirable biases. But on top of that, there are many unintended ways in which feedback might arise, and these are more pernicious and harder to control. Let’s look at three. Self-fulfilling predictions. Suppose a predictive policing system determines certain areas of a city to be at high risk for crime. More police officers might be deployed to such areas. Alternatively, officers in areas predicted to be high risk might be subtly lowering their threshold for stopping, searching, or arresting people—perhaps even unconsciously. Either way, the prediction will appear to be validated, even if it had been made purely based on data biases. Here’s another example of how acting on a prediction can change the outcome. In the United States, some criminal defendants are released prior to trial, whereas for others, a bail amount is set as a precondition of release. Many defendants are unable to post bail. Does the release or detention affect the outcome of the case? Perhaps defendants who are detained face greater pressure to plead guilty. At any rate, how could one possibly test the causal impact of detention without doing an experiment? Intriguingly, we can take advantage of a pseudo-experiment, namely that defendants are assigned bail judges quasi-randomly, and some judges are stricter than others. Thus, pre-trial detention is partially random, in a quantifiable way. Studies using this technique have confirmed that detention indeed causes an increase in the likelihood of a conviction.29 If bail were set based on risk predictions, whether human or algorithmic, and we evaluated its efficacy by examining case outcomes, we would see a self-fulfilling effect. Predictions that affect the training set. Continuing this example, predictive policing activity will lead to arrests, records of which might be added to the algorithm’s training set. These areas might then continue to appear to be at high risk of crime, and perhaps also other areas with a similar demographic composition, depending on the feature set used for predictions. The disparities might even compound over time. A 2016 paper by Lum and Isaac analyzed a predictive policing algorithm by PredPol. This is of the few predictive policing algorithms to be published in a peer-reviewed journal, for which the company deserves praise. By applying the algorithm to data derived from Oakland police records, the authors found that Black people would be targeted for predictive policing of drug crimes at roughly twice the rate of White people, even though the two groups have roughly equal rates of drug use.30 Their simulation showed that this initial bias would be 14 amplified by a feedback loop, with policing increasingly concentrated on targeted areas. This is despite the fact that the PredPol algorithm does not explicitly take demographics into account. A follow-up paper built on this idea and showed mathematically how feedback loops occur when data discovered on the basis of predictions are used to update the model.31 The paper also shows how to tweak the model to avoid feedback loops in a simulated setting: by quantifying how surprising an observation of crime is given the predictions, and only updating the model in response to surprising events. Predictions that affect the phenomenon and society at large. Prejudicial policing on a large scale, algorithmic or not, will affect society over time, contributing to the cycle of poverty and crime. This is a well-trodden thesis, and we’ll briefly review the sociological literature on durable inequality and the persistence of stereotypes in Chapter 8. Let us remind ourselves that we deploy machine learning so that we can act on its predictions. It is hard to even conceptually eliminate the effects of predictions on outcomes, future training sets, the phenomena themselves, or society at large. The more central machine learning becomes in our lives, the stronger this effect. Returning to the example of a search engine, in the short term it might be possible to extract an unbiased signal from user clicks, but in the long run, results that are returned more often will be linked to and thus rank more highly. As a side effect of fulfilling its purpose of retrieving relevant information, a search engine will necessarily change the very thing that it aims to measure, sort, and rank. Similarly, most machine learning systems will affect the phenomena that they predict. This is why we’ve depicted the machine learning process as a loop. Throughout this book we’ll learn methods for mitigating societal biases in machine learning, but we should keep in mind that there are fundamental lim- its to what we can achieve, especially when we consider machine learning as a socio-technical system instead of a mathematical abstraction. The textbook model of training and test data being independent and identically distributed is a simplification, and might be unachievable in practice. Getting concrete with a toy example Now let’s look at a concrete setting, albeit a toy problem, to illustrate many of the ideas discussed so far, and some new ones. Let’s say you’re on a hiring committee, making decisions based on just two attributes of each applicant: their college GPA and their interview score (we did say it’s a toy problem!). We formulate this as a machine-learning problem: the task is to use these two variables to predict some measure of the “quality” of an applicant. For example, it could be based on the average performance review score after two years at the company. We’ll assume we have data from past candidates that allows us to train a model to predict performance scores based on GPA and interview score. 15 Figure 4: Toy example: a hiring classifier that predicts job performance (not shown) based on GPA and interview score, and then applies a cutoff. Obviously, this is a reductive formulation—we’re assuming that an applicant’s worth can be reduced to a single number, and that we know how to measure that number. This is a valid criticism, and applies to most applications of data-driven decision-making today. But it has one big advantage: once we do formulate the decision as a prediction problem, statistical methods tend to do better than humans, even domain experts with years of training, in making decisions based on noisy predictors. Given this formulation, the simplest thing we can do is to use linear regression to predict the average job performance rating from the two observed variables, and then use a cutoff based on the number of candidates we want to hire. The figure above shows what this might look like. In reality, the variables under consideration need not satisfy a linear relationship, thus suggesting the use of a non-linear model, which we avoid for simplicity. As you can see in the figure, our candidates fall into two demographic groups, represented by triangles and squares. This binary categorization is a simplification for the purposes of our thought experiment. But when building real systems, enforcing rigid categories of people can be ethically questionable. Note that the classifier didn’t take into account which group a candidate belonged to. Does this mean that the classifier is fair? We might hope that it is, based on the fairness-as-blindness idea, symbolized by the icon of Lady Justice wearing a blindfold. In this view, an impartial model—one that doesn’t use the group membership in the regression—is fair; a model that gives different scores to otherwise-identical members of different groups is discriminatory. We’ll defer a richer understanding of what fairness means to later chapters, so let’s ask a simpler question: are candidates from the two groups equally likely to be positively classified? The answer is no: the triangles are more likely to be 16 selected than the squares. That’s because data is a social mirror; the “ground truth” labels that we’re predicting—job performance ratings—are systematically lower for the squares than the triangles. There are many possible reasons for this disparity. First, the managers who score the employees’ performance might discriminate against one group. Or the overall workplace might be less welcoming one group, preventing them from reaching their potential and leading to lower performance. Alternately, the disparity might originate before the candidates were hired. For example, it might arise from disparities in educational institutions attended by the two groups. Or there might be intrinsic differences between them. Of course, it might be a combination of these factors. We can’t tell from our data how much of the disparity is attributable to these different factors. In general, such a determination is methodologically hard, and requires causal reasoning.32 For now, let’s assume that we have evidence that the level of demographic disparity produced by our selection procedure is unjustified, and we’re interested in intervening to decrease it. How could we do it? We observe that GPA is correlated with the demographic attribute—it’s a proxy. Perhaps we could simply omit that variable as a predictor? Unfortunately, we’d also hobble the accuracy of our model. In real datasets, most attributes tend to be proxies for demographic variables, and dropping them may not be a reasonable option. Another crude approach is to pick different cutoffs so that candidates from both groups have the same probability of being hired. Or we could mitigate the demographic disparity instead of eliminating it, by decreasing the difference in the cutoffs. Given the available data, there is no mathematically principled way to know which cutoffs to pick. In some situations there is a legal baseline: for example, guidelines from the U.S. Equal Employment Opportunity Commission state that if the probability of selection for two groups differs by more than 20%, it might constitute a sufficient disparate impact to initiate a lawsuit. But a disparate impact alone is not illegal; the disparity needs to be unjustified or avoidable for courts to find liability. Even these quantitative guidelines do not provide easy answers or bright lines. At any rate, the pick-different-thresholds approach to mitigating disparities seems unsatisfying, because it is crude and uses the group attribute as the sole criterion for redistribution. It does not account for the underlying reasons why two candidates with the same observable attributes (except for group membership) may be deserving of different treatment. But there are other possible interventions, and we’ll discuss one. To motivate it, let’s take a step back and ask why the company wants to decrease the demographic disparity in hiring. One answer is rooted in justice to individuals and the specific social groups to which they belong. But a different answer comes from the firm’s selfish interests: diverse teams work better? (author?) ]WhyDiver95. From this perspective, increasing the diversity of the cohort that is hired would benefit the firm and everyone in the cohort. As an analogy, picking 11 goalkeepers, even if individually 17 excellent, would make for a poor soccer team. How do we operationalize diversity in a selection task? If we had a distance function between pairs of candidates, we could measure the average distance between selected candidates. As a strawman, let’s say we use the Euclidean distance based on the GPA and interview score. If we incorporated such a diversity criterion into the objective function, it would result in a model where the GPA is weighted less. This technique doesn’t explicitly consider the group membership. Rather, as a side-effect of insisting on diversity of the other observable attributes, it also improves demographic diversity. However, a careless application of such an intervention can easily go wrong: for example, the model might give weight to attributes that are completely irrelevant to the task. More generally, there are many possible algorithmic interventions beyond picking different thresholds for different groups. In particular, the idea of a similarity function between pairs of individuals is a powerful one, and we’ll see other interventions that make use of it. But coming up with a suitable similarity function in practice isn’t easy: it may not be clear which attributes are relevant, how to weight them, and how to deal with correlations between attributes. Justice beyond fair decision making The core concern of this book is group disparities in decision making. But ethical obligations don’t end with addressing those disparities. Fairly rendered decisions under unfair circumstances may do little to improve people’s lives. In many cases, we cannot achieve any reasonable notion of fairness through changes to decision- making alone; we need to change the conditions under which these decisions are made. In other cases, the very purpose of the system might be oppressive, and we should ask whether it should be deployed at all. Further, decision making systems aren’t the only places where machine learning is used that can harm people: for example, online search and recommendation algorithms are also of concern, even though they don’t make decisions about people. Let’s briefly discuss these broader questions. Interventions that target underlying inequities Let’s return to the hiring example above. When using machine learning to make predictions about how someone might fare in a specific workplace or occupation, we tend to treat the environment that people will confront in these roles as a constant and ask how people’s performance will vary according to their observable characteristics. In other words, we treat the current state of the world as a given, leaving us to select the person who will do best under these circumstances. This approach risks overlooking more fundamental changes that we could make to the workplace (culture, family friendly policies, on-the-job training) that might make it a more welcoming and productive environment for people that have not flourished under previous conditions.34 18 The tendency with work on fairness in machine learning is to ask whether an employer is using a fair selection process, even though we might have the opportunity to intervene in the workplace dynamics that actually account for differences in predicted outcomes along the lines of race, gender, disability, and other characteristics.35 We can learn a lot from the so-called social model of disability, which views a predicted difference in a disabled person’s ability to excel on the job as the result of a lack of appropriate accommodations (an accessible workplace, necessary equipment, flexible working arrangements) rather than any inherent capacity of the person. A person is only disabled in the sense that we have not built physical environments or adopted appropriate policies to ensure their equal participation. The same might be true of people with other characteristics, and changes to the selection process alone will not help us address the fundamental injustice of conditions that keep certain people from contributing as effectively as others. We examine these questions in Chapter 8. It may not be ethical to deploy an automated decision-making system at all if the underlying conditions are unjust and the automated system would only serve to reify it. Or a system may be ill-conceived, and its intended purpose may be unjust, even if it were to work flawlessly and perform equally well for everyone. The question of which automated systems should be deployed shouldn’t be left to the logic (and whims) of the marketplace. For example, we may want to regulate the police’s access to facial recognition. Our civil rights—freedom or movement and association—are threatened by these technologies both when they fail and when they work well. These are concerns about the legitimacy of an automated decision making system, and we explore them in Chapter 2. The harms of information systems When a defendant is unjustly detailed pre-trial, the harm is clear. But beyond algo- rithmic decision making, information systems such as search and recommendation algorithms can also have negative effects, but here the harm is indirect and harder to define. Here’s one example. Image search results for occupation terms such as CEO or software developer reflect (and arguably exaggerate) the prevailing gender composition and stereotypes about those occupations.36 Another example that we encountered earlier is the gender stereotyping in online translation. These and other examples that are disturbing to varying degrees—such as Google’s app labeling photos of Black Americans as “gorillas”, or offensive results in autocomplete—seem to fall into a different moral category than, say, a discriminatory system used in criminal justice, which has immediate and tangible consequences. A talk by Kate Crawford lays out the differences.37 When decision-making systems in criminal justice, health care, etc. are discriminatory, they create allocative harms, which are caused when a system withholds certain groups an opportunity or a resource. In contrast, the other examples—stereotype perpetuation and cultural denigration—are examples of representational harms, which occur when systems 19 reinforce the subordination of some groups along the lines of identity—race, class, gender, etc. Allocative harms have received much attention both because their effects are immediate, and because they are easier to formalize and study in computer science and in economics. Representational harms have long-term effects, and resist formal characterization. But as machine learning has become a part of how we make sense of the world—through technologies such as search, translation, voice assistants, and image labeling—representational harms will leave an imprint on our culture, and influence identity formation and stereotype perpetuation. Thus, these are critical concerns for the fields of natural language processing and computer vision. Although this book is primarily about allocative harms, we will briefly representational harms in Chapters 7 and 9. The majority of content consumed online is mediated by recommendation algorithms that influence which users see which content. Thus, these algorithms in- fluence which messages are amplified. Social media algorithms have been blamed for a litany of ills: echo chambers in which users are exposed to content that conforms to their prior beliefs; exacerbating political polarization; radicalization of some users into fringe beliefs; stoking ethnic resentment and violence; a deteri- oration of mental health; and so on. Research on these questions is nascent and establishing causality is hard, and it remains unclear how much of these effects are due to the design of the algorithm versus user behavior. But there is little doubt that algorithms have some role. Twitter experimentally compared a non-algorithmic (reverse chronological) content feed to an algorithmic feed, and found that content from the mainstream political right was consistently favored in the algorithmic setting than content from the mainstream political left.38 While important, this topic is out of scope for us. However, we briefly touch on discrimination in ad targeting and in online marketplaces in Chapter 7. Our outlook: limitations and opportunities We’ve seen how machine learning propagates inequalities in the state of the world through the stages of measurement, learning, action, and feedback. Machine learning systems that affect people are best thought of as closed loops, since the actions we take based on predictions in turn affect the state of the world. One major goal of fair machine learning is to develop an understanding of when these disparities are harmful, unjustified, or otherwise unacceptable, and to develop interventions to mitigate such disparities. There are fundamental challenges and limitations to this goal. Unbiased mea- surement might be infeasible even in principle, such as when the construct itself (e.g. race) is unstable. There are additional practical limitations arising from the fact that the decision maker is typically not involved in the measurement stage. Further, observational data can be insufficient to identify the causes of disparities, which is needed in the design of meaningful interventions and in order to understand the effects of intervention. Most attempts to “debias” machine learning in the 20 current research literature assume simplistic mathematical systems, often ignoring the effect of algorithmic interventions on individuals and on the long-term state of society. Despite these important limitations, there are reasons to be cautiously optimistic about fairness and machine learning. First, data-driven decision-making has the potential to be more transparent compared to human decision-making. It forces us to articulate our decision-making objectives and enables us to clearly understand the tradeoffs between desiderata. However, there are challenges to overcome to achieve this potential for transparency. One challenge is improving the interpretability and explainability of modern machine learning methods, which is a topic of vigorous ongoing research. Another challenge is the proprietary nature of datasets and systems that are crucial to an informed public debate on this topic. Many commentators have called for a change in the status quo.39 Second, effective interventions do exist in many machine learning applications, especially in natural-language processing and computer vision. Tasks in these domains (say, transcribing speech) are subject to less inherent uncertainty than traditional decision-making (say, predicting if a loan applicant will repay), removing some of the statistical constraints that we’ll study in Chapter 3. Our final and most important reason for optimism is that the turn to automated decision-making and machine learning offers an opportunity to reconnect with the moral foundations of fairness. Algorithms force us to be explicit about what we want to achieve with decision-making. And it’s far more difficult to paper over our poorly specified or true intentions when we have to state these objectives formally. In this way, machine learning has the potential to help us debate the fairness of different policies and decision-making procedures more effectively. We should not expect work on fairness in machine learning to deliver easy answers. And we should be suspicious of efforts that treat fairness as something that can be reduced to an algorithmic stamp of approval. We must try to confront, not avoid, the hard questions when it comes to debating and defining fairness. We may even need to reevaluate the meaningfulness and enforceability of existing approaches to discrimination in law and policy,9 expanding the tools at our disposal to reason about fairness and seek out justice. We hope that this book can play a small role in stimulating this interdisciplinary inquiry. Bibliographic notes and further reading This chapter draws from several taxonomies of biases in machine learning and data-driven decision-making: a blog post by Moritz Hardt,25 a paper by Barocas and Selbst,9 and a 2016 report by the White House Office of Science and Technology Policy.40 For a broad survey of challenges raised by AI, machine learning, and algorithmic systems, see the AI Now report.41 An early work that investigated fairness in algorithmic systems is by Friedman and Nissenbaum in 1996.42 Papers studying demographic disparities in classifica- 21 tion began appearing regularly starting in 2008;43 the locus of this research was in Europe, and in the data mining research community. With the establishment of the FAT/ML workshop in 2014, a new community emerged, and the topic has since grown in popularity. Several popular-audience books have delivered critiques of algorithmic systems in modern society: The Black Box Society by Frank Pasquale,44 Weapons of Math Destruction by Cathy O’Neill,45 Automating inequality by Virginia Eubanks,46 and Algorithms of Oppression by Safiya Noble.47 22 2 When is automated decision making legitimate? These three scenarios have something in common: A student is proud of the creative essay she wrote for a standardized test. She receives a perfect score, but is disappointed to learn that the test had in fact been graded by a computer. A defendant finds that a criminal risk prediction system categorized him as high risk for failure to appear in court, based on the behavior of others like him, despite his having every intention of appearing in court on the appointed date. An automated system locked out a social media user for violating the plat- form’s policy on acceptable behavior. The user insists that they did nothing wrong, but the platform won’t provide further details nor any appeal process. All of these are automated decision-making or decision support systems that likley feel unfair or unjust. Yet this is a sense of unfairness that is distinct from what we talked about in the first chapter (and which we will return to in the next chapter). It is not about the relative treatment of different groups. Instead, what these questions are about is legitimacy — whether it is fair to deploy such a system at all in a given scenario. That question, in turn, affects the legitimacy of the organization deploying it. Most institutions need legitimacy to be able to function effectively. People have to believe that the institution is broadly aligned with social values. The reason for this is relatively clear in the case of public institutions such as the government, or schools, which are directly or indirectly accountable to the public. It is less clear why private firms need legitimacy. One answer is that the more power a firm has over individuals, the more the exercise of that power needs to be perceived as legitimate. And decision making about people involves exercising power over them, so it is important to ensure legitimacy. Otherwise, people will find various ways to resist, notably through law. A loss of legitimacy might also hurt a firm’s ability to compete in the market. Questions about firms’ legitimacy have repeatedly come up in the digital technology industry. For example, ride sharing firms have faced such questions, leading to activism, litigation, and regulation. Firms whose business models rely 23 on personal data, especially covertly collected data, have also undergone crises of perception. In addition to legal responses, such firms have seen competitors capitalize on their lax privacy practices. For instance, Apple made it harder for Facebook to track users on iOS, putting a dent in its revenue.48 This move enjoyed public support despite Facebook’s vociferous protests, arguably because the underlying business model had lost legitimacy. For these reasons, a book on fairness is incomplete without a discussion of legit- imacy. Moreover, the legitimacy question should precede other fairness questions. Debating distributive justice in the context of a fundamentally unjust institution is at best a waste of time, and at worst helps prop up the institution’s legitimacy, and is thus counterproductive. For example, improving facial analysis technology to decrease the disparity in error rates between racial groups is not a useful response to concerns about the use of such technologies for oppressive purposes.49 Discussions of legitimacy have been largely overshadowed by discussions of bias and discrimination in the fairness discourse. Often, advocates have chosen to focus on distributional considerations as a way of attacking legitimacy, since it tends to be easier argument to make. But this can backfire, as many firms have co- opted fairness discourse, and find it relatively easy to ensure parity in the decisions between demographic groups without addressing the legitimacy concerns.50 This chapter is all about legitimacy: whether it is morally justifiable to use machine learning or automated methods at all in a given scenario.1 Machine learning is not a replacement for human decision making Machine learning plays an important role in decisions that allocate resources and opportunities that are critical to people’s life chances. The stakes are clearly high. But people have been making high stakes decisions about each other for a long time, and those decisions seem to be subject to far less critical examination. Here’s a strawman view: decisions based on machine learning are analogous to decision making by humans, and so machine learning doesn’t warrant special concern. While it’s true that machine learning models might be difficult for people to understand, humans are black boxes, too. And while there can be systematic bias in machine learning models, they are often demonstrably less biased than humans. We reject this analogy of machine learning to human decision making. By understanding why it fails and which analogies are more appropriate, we’ll develop a better appreciation for what makes machine learning uniquely dangerous as a way of making high-stakes decisions. While machine learning is sometimes used to automate the tasks performed inside a human’s head, many of the high-stakes decisions that are the focus of the work on fairness and machine learning are those that have been traditionally 1 Although we have stressed the overriding importance of legitimacy, readers interested in dis- tributive questions may skip to Chapter 3 for a technical treatment or to Chapter 4 for a normative account; those chapters, Chapter 3 in particular, do not directly build on this one. 24 performed by bureaucracies. For example, hiring, credit, and admissions decisions are rarely left to one person to make on their own as they see fit. Instead, these decisions are guided by formal rules and procedures, involving many actors with prescribed roles and responsibilities. Bureaucracy arose in part as a response to the subjectivity, arbitrariness, and inconsistency of human decision making; its institutionalized rules and procedures aim to minimize the effects of humans’ frailties as individual decision makers.51 Of course, bureaucracies aren’t perfect. The very term bureaucracy tends to have a negative connotation — a needlessly convoluted process that is difficult or impossible to navigate. And despite their overly formalistic (one might say cold) approach to decision making, bureaucracies rarely succeed in fully disciplining the individual decision makers that occupy their ranks. Bureaucracies risk being equally capricious and inscrutable as humans, but far more dehumanizing.51 That’s why bureaucracies often incorporate procedural protections: mechanisms that ensure that decisions are made transparently, on the basis of the right and relevant information, and with the opportunity for challenge and correction. Once we realize that machine learning is being used to automate bureaucratic rather than individual decisions, asserting that humans don’t need to — or simply cannot — account for their everyday decisions does not excuse machine learning from these expectations. As Katherine Strandburg has argued, “[r]eason giving is a core requirement in conventional decision systems precisely because human decision makers are inscrutable and prone to bias and error, not because of any expectation that they will, or even can, provide accurate and detailed descriptions of their thought processes”.52 In analogizing machine learning to bureaucratic — rather than individual — decision making, we can better appreciate the source of some of the concerns about machine learning. When it is used in high-stakes domains, it undermines the kinds of protections that we often put in place to ensure that bureaucracies are engaged in well-executed and well-justified decision making. Bureaucracy as a bulwark against arbitrary decision making The kind of problematic decision making that bureaucracies protect against can be called arbitrary decision making. Kathleen Creel and Deborah Hellman have usefully distinguished betweeen two flavors of arbitrariness.53 First, arbitrariness might refer to decisions made on an inconsistent or ad hoc basis. Second, arbi- trariness might refer to the basis for decision making lacking reasoning, even if the decisions are made consistently on that basis. This first view of arbitrariness is principally concerned with procedural regularity:54 whether a decision making scheme is executed consistently and correctly. Worries about arbitrariness, in this case, are really worries about whether the rules governing important decisions are fixed in advance and applied appropriately, with the goal of reducing decision makers’ capacity to make decisions in a haphazard manner. When decision making is arbitrary in this sense of the term, individuals may 25 find that they are subject to different decision-making schemes and receive different decisions simply because they happen to go through the decision-making process at different times. Not only might the decision-making scheme change over time; human decision makers might be inconsistent in how they apply these schemes as they make their way through different cases. The latter could be true of one individual decision maker whose behavior is inconsistent over time, but it could also be true if the decision-making process allocates cases to different individuals who are individually consistent, but differ from one another. Thus, even two people who are identical when it comes to the decision criteria may receive different decisions, violating the expectation that similar people should be treated similarly when it comes to high-stakes decisions. This principle is premised on the belief that people are entitled to similar decisions unless there are reasons to treat them differently (we’ll soon address what determines if these are good reasons). For especially consequential decisions, people may have good reason to wonder why someone who resembled them received the desired outcome from the decision-making process while they did not. Inconsistency is also problematic when it prevents people from developing effective life plans based on expectations about the decision-making systems they must navigate in order to obtain desirable resources and opportunities.53 Thus, inconsistent decision making is unjust both because it might result in unjustified differential treatment of similar individuals and also because it is a threat to individual autonomy by preventing people from making effective decisions about how best to pursue their life goals. The second view of arbitrariness is getting at a deeper concern: are there good reasons — or any reasons — why the decision-making scheme looks the way that it does? For example, if a coach picks a track team based on the color of runners’ sneakers, but does so consistently, it is still arbitrary because the criterion lacks a valid basis. It does not help advance the decision maker’s goals (e.g., assembling a team of runners that will win the upcoming meet). Arbitrariness, from this perspective, is problematic because it undermines a bedrock justification for the chosen decision-making scheme: that it actually helps to advance the goals of the decision maker. If the decision-making scheme does nothing to serve these goals, then there is no justified reason to have settled on that decision-making scheme — and to treat people accordingly. When desirable resources and opportunities are allocated arbitrarily, it needlessly subjects indi- viduals to different decisions, despite the fact that all individuals may have equal interest in these resources and opportunities. In the context of government decision making, there is often a legal requirement that there be a rational basis for decision making — that is, that there be good reasons for making decisions in the way that they are.53 Rules that do not help the government achieve its stated policy goals run afoul of the principles of due process. This could be either because the rules were chosen arbitrarily or because of some evident fault with the reasoning behind these rules. These requirements stem from the fact that the government has a monopoly over certain highly consequential decisions, leaving people with no opportunity to seek recourse by trying their case 26 with another decision maker. There is no corresponding legal obligation when the decision makers are private actors, as Creel and Hellman point out. Companies are often free to make poorly reasoned — even completely arbitrary — decisions. In theory, decision-making schemes that seem to do nothing to advance private actors’ goals should be pushed out of the market by competing schemes that are more effective.53 Despite this, we often expect that decisions of major consequence, even when they are made by private actors, are made for good reasons. We are not likely to tolerate employers, lenders, or admission officers that make decisions about applicants by flipping a coin or according to the color of applicants’ sneakers. Why might this be? Arbitrary decision making fails to respect the gravity of these decisions and shows a lack of respect for the people subject to them. Even if we accept that we cannot dictate the goals of institutions, we still object to giving them complete freedom to treat people however they like. When the stakes are sufficiently high, decision makers bear some burden for justifying their decision-making schemes out of respect for the interests of people affected by these decisions. The fact that people might try their luck with other decision makers in the same domain (e.g., another employer, lender, or admission officer) may do little to modulate these expectations. Three Forms of Automation To recap our earlier discussion, automation might undermine important procedural protections in bureaucratic decision making. But what, exactly, does machine learning help to automate? It turns out that there are three different types of automation. The first kind of automation involves taking decision-making rules that have been set down by hand (e.g., worked out through a traditional policy-making process) and translating these into software, with the goal of automating their ap- plication to particular cases.55 For example, many government agencies follow this approach when they adopt software to automate benefits eligibility determinations in accordance with pre-existing policies. Likewise, employers follow this approach when they identify certain minimum qualifications for a job and develop software to automatically reject applicants that do not possess them. In both of these cases, the rules are still set by humans, but their application is automated by a computer; machine learning has no obvious role here. But what about cases where human decision makers have primarily relied on informal judgment rather than formally specified rules? This is where the second kind of automation comes in. It uses machine learning to figure out how to replicate the informal judgements of humans. Having automatically discovered a decision-making scheme that produces the same decisions as humans have made in the past, it then implements this scheme in software to replace the humans who had been making these decisions. The student whose creative essay was subject to 27 computerized assessment, described in the opening of this chapter, is an example of just such an approach: the software in this case seeks to replicate the subjective evaluations of human graders. The final kind of automation is quite different from the first two. It does not rely on an existing bureaucratic decision making scheme or human judgment. Instead, it involves learning decision-making rules from data. It uses a computer to uncover patterns in a dataset that predict an outcome or property of policy interest — and then bases decisions on those predictions. Note that such rules could be applied either manually (by humans) or automatically (through software). The relevant point of automation, in this case, is in the process of developing the rules, not necessarily applying them. For example, these could be rules that instruct police to patrol certain areas, given predictions about the likely incidence of crime based on past observations of crime. Or they could be rules that suggest that lenders grant credit to certain applicants, given the repayment histories of past recipients like them. Machine learning — and other statistical techniques — are crucial to this form of automation. As we’ll see over the next three sections, each type of automation raises its own unique concerns. Automating Pre-Existing Decision-Making Rules In many respects, the first form of automation — translating pre-existing rules into software so that decisions can be executed automatically — is a direct response to arbitrariness as inconsistency. Automation helps ensure consistency in decision making because it requires that the scheme for making decisions be fixed. It also means that the scheme is applied the same way every time. And yet, many things can go wrong. Danielle Citron offers a compelling account of the dangers of automating decision-making rules established via a deliberative policy-making or rule-making process.55 Automating the execution of a pre-existing decision-making scheme requires translating such a scheme into code. Programmers might make errors in that process, leading to automated decisions that diverge from the policy that the software is meant to execute. Another problem is that the policy that programmers are tasked with automating may be insufficiently explicit or precise; in the face of such ambiguity, programmers might take it upon themselves to make their own judgment calls, effectively usurping the authority to define policy. And at the most basic level, software may be buggy. For example, hundreds of British postmasters were convicted for theft or fraud over a twenty year period based on flawed software in what has been called the biggest miscarriage of justice in British history.56 Automating decision making can also be problematic when it completely stamps out any room for discretion. While human discretion presents its own issues, as described above, it can be useful when it is difficult or impossible to fully specify how decisions should be made in accordance with the goals and principles of the institution.57 Automation requires that an institution determine in advance all of the criteria that a decision-making scheme will take into account; there is no room 28 to consider the relevance of additional details that might not have been considered or anticipated at the time that the software was developed. Automated decision-making is thus likely to be much more brittle than decision- making that involves manual review because it limits the opportunity for decision subjects to introduce information into the decision-making process. People are confined to providing evidence that corresponds to a pre-established field in the software. Such constraints can result in absurd situations in which the strict application of decision-making rules leads to outcomes that are directly counter to the goals behind these rules. New evidence that would immediately reverse the assessment of a human decision maker may have no place in automated decision making.58 For example, in an automated system to assess people with illnesses to determine eligibility for a state-provided caregiver, one field asked if there were any foot problems. An assessor visited a certain person and filled out the field to indicate that they didn’t have any problems — because they were an amputee.59 Discretion is valuable in these cases because humans are often able to reflect on the relevance of additional information to the decision at hand and the underlying goal that such decisions are meant to serve. In effect, human review leaves room to expand the criteria under consideration and to reflect on when the mechanical application of the rules fails to serve their intended purpose.60, 58 These same constraints can also restrict people’s ability to point out errors or to challenge the ultimate decision.61 When interacting with a loan officer, a person could point out that their credit file contains erroneous information. When applying for a loan via an automated process, they might have no equivalent opportunity. Or perhaps a person recognizes that the rules dictating their eligibility for government benefits have been applied incorrectly. When caseworkers are replaced by software, people subject to these decisions may have no means to raise justified objections.62 Finally, automation runs the serious risk of limiting accountability and exacer- bating the dehumanizing effects of dealing with bureaucracies. Automation can make it difficult to identify the agent responsible for a decision; software often has the effect of dispersing the locus of accountability because the decision seems to be made by no one.63 People may have more effective means of disputing decisions and contesting the decision-making scheme when decision-making is vested in identifiable people. Likewise, automation’s ability to remove humans from the decision-making process may contribute to people’s sense that an institution does not view them as worthy of the respect that would grant them an opportunity to make legitimate corrections, introduce additional relevant information, or describe mitigating circumstances.64 This is precisely the problem highlighted by the open- ing example of a social media user who had been kicked off a platform without explanation or opportunity for appeal. We’ve highlighted many normative concerns that arise from simply automating the application of a pre-existing decision-making scheme. While many of these issues are commonly attributed to the adoption of machine learning, none of them originate from the use of machine learning specifically. Long-standing efforts to automate decision-making with traditional software pose many dangers of their 29 own. The fact that machine learning is not the exclusive cause of these types of problems is no reason to take them any less seriously, but effective responses to these problems requires that we be clear about their origins. Learning Decision-Making Rules from Data on Past Decisions in order to Automate Them Decision makers might have a pre-existing but informal process for making deci- sions which they might like to automate. In this case, machine learning (or other statistical techniques) might be employed to “predict” how a human would make a decision, given certain criteria. The goal isn’t necessarily to perfectly recover the specific weight that past decision makers had implicitly assigned to different criteria, but rather to ensure that the model produces a similar set of decisions as humans. To return to one of our recurring examples, an educational institution might want to automate the process of grading essays, and it might attempt to do that by relying on machine learning to learn to mimic the grades teachers have assigned to similar work in the past. This form of automation might help to address concerns with arbitrariness in human decision making by formalizing and fixing a decision-making scheme similar to what humans might have been employing in the past. In this respect, machine learning might be desirable because it can help to smooth out any incon- sistencies in the human decisions from which it has induced some decision-making rule. For example, the essay grading model described above might reduce some of the variance observed in the grading of teachers whose subjective evaluations the model is learning to replicate. Automation can once again help to address concerns with arbitrariness understood as inconsistency, even when it is subjective judgments that are being automated. A few decades ago, there was a popular approach to automation that relied on explicitly encoding the reasoning that humans relied on to make decisions.65 This approach, called expert systems, failed for many reasons, including the fact that people aren’t always able to explain their own reasoning.66 Expert systems eventually gave way to the approach of simply asking people to label examples and having learning algorithms discover how to best predict the label that humans would assign. While this approach has proved powerful, it has its dangers. First, it may give the veneer of objective assessment to decision-making schemes that simply automate the subjective judgment of humans. As a result, people may be more likely to view its decisions as less worthy of critical investigation. This is particularly worrisome because learning decision-making rules from the previous decisions made by humans runs the obvious risk of replicating and exaggerating any objectionable qualities of human decision making by learning from the bad examples set by humans. (In fact, many attempts to learn a rule to predict some seemingly objective target of interest — the form of automation that we’ll discuss in the next section — are really just a version of replicating human judgment in disguise. If we can’t obtain objective ground truth for the chosen target of prediction, there is no way to escape human judgment. As David Hand points out, 30 humans will often need to exercise discretion in specifying and identifying what counts as an example of the target.67 ) Second, such decision-making schemes may be regarded as equivalent to those employed by humans and thus likely to operate in the same way, even though the model might reach its decisions differently and produce quite different error patterns.68 Even when the model is able to predict the decisions that humans would make given any particular input with a high degree of accuracy, there is no guarantee that the model will have inherited all of the nuance and considerations that go into human decision-making. Worse, models might also learn to rely on criteria in ways that humans would find worrisome or objectionable, even if doing so still produces a similar set of decisions as humans would make.69 For example, a model that automates essay grading by assigning higher scores to papers that employ sophisticated vocabulary may do a reasonably good job replicating the judgments of human graders (likely because higher quality writing tends to rely on more sophisticated vocabulary), but checking for the presense of certain words is unlikely to be a reliable substitute for assessing an essay for logical coherence and factual correctness.70 In short, the use of machine learning to automate decisions previously per- formed by humans can be problematic because it can end up being both too much like human decision makers and too different from them. Deriving Decision-Making Rules by Learning to Predict a Target The final form of automation is one in which decision makers rely on machine learning to learn a decision-making rule or policy from data. This form of automa- tion, which we’ll call predictive optimization, speaks directly to concerns with reasoned decision making. Note that neither of the first two forms of automation does so. Consistently executing a pre-existing policy via automation does not ensure that the policy itself is a reasoned one. Nor does relying on past human decisions to induce a decision-making rule guarantee that the basis for automated decision making will reflect reasoned judgments. In both cases, the decision mak- ing scheme will only be as reasoned as the formal policy or informal judgments whose execution is being automated. In contrast, predictive optimization tries to provide a more rigorous foundation for decision making by only relying on criteria to the extent that they demonstrably predict the outcome or quality of interest. When employed in this manner, machine learning seems to ensure reasoned decisions because the criteria that have been incorporated into the decision making scheme — and their particular weighing — are dictated by how well they predict the target. And so long as the chosen target is a good proxy for decision makers’ goals, relying on criteria that predict this target to make decisions would seem well reasoned because doing so will help to achieve decision makers’ goals. Unlike the first two forms of automation, predictive optimization is a radical departure from the traditional approach to decision making. In the traditional approach, a set of decision makers has some goal — even if this goal is amorphous 31 and hard to specify — and would like to develop an explicit decision-making scheme to help realize their goal. They engage in discussion and deliberation to try to come to some agreement about the criteria that are relevant to the decision and the weight to assign to each criterion in the decision-making scheme. Relying on intuition, prior evidence, and normative reasoning, decision makers will choose and combine features in ways that are thought to help realize their goals. The statistical or machine learning approach works differently. First, the decision makers try to identify an explicit target for prediction which they view as synonymous with their goal — or a reasonable proxy for it. In a college admissions scenario, one goal might be scholastic achievement in college, and college GPA might be a proxy for it. Once this is settled, the decision makers use data to discover which criteria to use and how to weight them in order to best predict the target. While they might exercise discretion in choosing the criteria to use, the weighting of these criteria would be dictated entirely by the goal of maximizing the accuracy of the resulting prediction of the chosen target. In other words, the decision-making rule would, in large part, be learned from data, rather than set down according to decision makers’ subjective intuitions, expectations, and normative commitments. Predictive optimization Traditional approach approach Example: college Holistic approach that takes Train a model based on past admissions into account achievements, students’ data to predict character, special applicants’ GPA if admitted; circumstances, and other admit highest scoring factors applicants Goal and target No explicit target; goal is Define an explicit target; implicit (and there are assume it is a good proxy for usually multiple goals) the goal Focus of Debate is about how the Debate is largely about the deliberation criteria should affect the choice of target decision Effectiveness May fail to produce rules Predictive accuracy can be that meet their putative quantified objectives Range of Easier to incorporate Harder to incorporate normative multiple normative multiple normative considerations principles such as need principles Justification Can be difficult to divine Reasons for the chosen rule makers’ reasons for decision making scheme are choosing a certain decision made explicit in choice of making scheme target Each approach has pluses and minuses from a normative perspective. The traditional approach makes it possible to express multiple goals and normative 32 values through the choice of criteria and the weight assigned to them. In the machine learning approach, multiple goals and normative considerations need to be packed into the choice of target. In college admissions, those goals and considerations might include — in addition to scholastic potential — athletic and leadership potential, the extent to which the applicant would contribute to campus life, whether the applicant brings unusual life experiences, their degree of need, and many others. The most common approach is to define a composite target variable that linearly combines multiple components, but this quickly becomes unwieldy and is rarely subject to robust debate. There is also some room to exercise normative judgment about the choice to include or exclude certain decision criteria, but is a far cry from deliberative policy-making. On the other hand, if we believe that a target does, in fact, capture the full range of goals that decision makers have in mind, machine learning models might be able to serve these goals more effectively. For example, in a paper that compares the two approaches to policy making, Rebecca Johnson and Simone Zhang show that the traditional approach (i.e., manually crafting rules via a process of deliberation and debate) often fails to produce rules that meet their putative objectives.71 In examining rules for allocating housing assistance, they find that housing authorities prioritize veterans above particularly rent-burdened households, despite the fact that supporting such households would seem to be more in line with the policy’s most basic goal. Johnson and Zhang assert that while this prioritization might be the actual intent of the policymakers setting the rules, the reasons for this prioritization are rarely made explicit in the process of deliberation and are especially difficult to discern after the fact. Were these rules developed instead using machine learning, policymakers would need to agree on an explicit target of prediction, which would leave much less room for confusion about policymakers’ intent. And it would ensure that the resulting rules are only designed to predict that target.71 As Rediet Abebe, Solon Barocas, Jon Kleinberg, and colleagues have argued, “[t]he nature of computing is such that it requires explicit choices about inputs, objectives, constraints, and assumptions in a system”72 — and this may be a good thing if it forces certain policy considerations and normative judgements into the open. The machine learning approach nevertheless runs the serious risk of focusing narrowly on the accuracy of predictions. In other words, “good” decisions are those that accurately predict the target. But decision making might be “good” for other reasons: focusing on the right qualities or outcomes (in other words, the target is a good proxy for the goal), considering only relevant factors, considering the full set of relevant factors, incorporating other normative principles (e.g., need, desert, etc.), or allowing people to understand and potentially contest the policy. Even a decision making process that is not terribly accurate might be seen as good if it has some