An Introduction to Data Ethics PDF

An Introduction to Data Ethics MODULE AUTHOR: Shannon Vallor, Ph.D. William J. Rewak, S.J. Professor of Philosophy, Santa Clara University 1. What do we mean when we talk about ‘ethics’? Et...

An Introduction to Data Ethics MODULE AUTHOR: Shannon Vallor, Ph.D. William J. Rewak, S.J. Professor of Philosophy, Santa Clara University 1. What do we mean when we talk about ‘ethics’? Ethics in the broadest sense refers to the concern that humans have always had for figuring out how best to live. The philosopher Socrates is quoted as saying in 399 B.C. that “the most important thing is not life, but the good life.”2 We would all like to avoid a bad life, one that is shameful and sad, fundamentally lacking in worthy achievements, unredeemed by love, kindness, beauty, friendship, courage, honor, joy, or grace. Yet what is the best way to obtain the opposite of this – a life that is not only acceptable, but even excellent and worthy of admiration? How do we identify a good life, one worth choosing from among all the different ways of living that lay open to us? This is the question that the study of ethics attempts to answer. Today, the study of ethics can be found in many different places. As an academic field of study, it belongs primarily to the discipline of philosophy, where it is studied either on a theoretical level (‘what is the best theory of the good life?’) or on a practical, applied level as will be our focus (‘how should we act in this or that situation, based upon our best theories of ethics?’). In community life, ethics is pursued through diverse cultural, religious, or regional/local ideals and practices, through which particular groups give their members guidance about how best to live. This political aspect of ethics introduces questions about power, justice, and responsibility. On a personal level, ethics can be found in an individual’s moral reflection and continual strivings to become a better person. In work life, ethics is often formulated in formal codes or standards to which all members of a profession are held, such as those of medical or legal ethics. Professional ethics is also taught in dedicated courses, such as business ethics. It is important to recognize that the political, personal, and professional dimensions of ethics are not separate—they are interwoven and mutually influencing ways of seeking a good life with others. 2. What does ethics have to do with technology? There is a growing international consensus that ethics is of increasing importance to education in technical fields, and that it must become part of the language that technologists are comfortable using. Today, the world’s largest technical professional organization, IEEE (the Institute for Electrical and Electronics Engineers), has an entire division devoted just to technology ethics.3 In2014IEEEbeganholdingitsowninternationalconferencesonethicsin engineering, science, and technology practice. To supplement its overarching professional code ofethics, IEEE is also working on new ethical standards in emerging areas such as AI, robotics, and data management. What is driving this growing focus on technology ethics? What is the reasoning behind it? The basic rationale is really quite simple. Technology increasingly shapes how human beings seek the good life, and with what degree of success. Well-designed and well-used technologies can make it easier for people to live well (for example, by allowing more efficient use and distribution of essential resources for a good life, such as food, water, energy, or medical care). Poorly designed or misused technologies can make it harder to live well (for example, by toxifying our environment, or by reinforcing unsafe, unhealthy or antisocial habits). Technologies are not ethically ‘neutral’, for they reflect the values that we ‘bake in’ to them with our design choices, as well as the values which guide our distribution and use of them. Technologies both reveal and shape what humans value, what we think is ‘good’ in life and worth seeking. Of course, this always been true; technology has never been separate from our ideas about the good life. We don’t build or invest in a technology hoping it will make no one’s life better, or hoping that it makes all our lives worse. So what is new, then? Why is ethics now such an important topic in technical contexts, more so than ever? The answer has partly to do with the unprecedented speeds, scales and pervasiveness with which technical advances are transforming the social fabric of our lives, and the inability of regulators and lawmakers to keep up with these changes. Laws and regulations have historically been important instruments of preserving the good life within a society, but today they are being outpaced by the speed, scale, and complexity of new technological developments and their increasingly pervasive and hard-to-predict social impacts. Additionally, many lawmakers lack the technical expertise needed to guide effective technology policy. This means that technical experts are increasingly called upon to help anticipate those social impacts and to think proactively about how their technical choices are likely to impact human lives. This means making ethical design and implementation choices in a dynamic, complex environment where the few legal ‘handrails’ that exist to guide those choices are often outdated and inadequate to safeguard public well-being. For example: face- and voice-recognition algorithms can now be used to track and create a lasting digital record of your movements and actions in public, even in places where previously you would have felt more or less anonymous. There is no consistent legal framework governing this kind of data collection, even though such data could potentially be used to expose a person’s medical history (by recording which medical and mental health facilities they visit), their religiosity (by recording how frequently they attend services and where), their status as a victim of violence (by recording visits to a victims services agency) or other sensitive information, up to and including the content of their personal conversations in the street. What does a person given access to all that data, or tasked with analyzing it, need to understand about its ethical significance and power to affect a person’s life? Another factor driving the recent explosion of interest in technology ethics is the way in which 21st century technologies are reshaping the global distribution of power, justice, and responsibility. Companies such as Facebook, Google, Amazon, Apple, and Microsoft are now seen as having levels of global political influence comparable to, or in some cases greater than, that of states and nations. In the wake of revelations about the unexpected impact of social media and private data analytics on 2017 elections around the globe, the idea that technology companies can safely focus on profits alone, leaving the job of protecting the public interest wholly to government, is increasingly seen as naı̈ve and potentially destructive to social flourishing. Not only does technology greatly impact our opportunities for living a good life, but its positive and negative impacts are often distributed unevenly among individuals and groups. Technologies can create widely disparate impacts, creating ‘winners’ and ‘losers’ in the social lottery or magnifying existing inequalities, as when the life-enhancing benefits of a new technology are enjoyed only by citizens of wealthy nations while the life-degrading burdens of environmental contamination produced by its manufacture fall upon citizens of poorer nations. In other cases, technologies can help to create fairer and more just social arrangements, or create new access to means of living well, as when cheap, portable solar power is used to allow children in rural villages without electric power to learn to read and study after dark. How do we ensure that access to the enormous benefits promised by new technologies, and exposure to their risks, are distributed in the right way? This is a question about technology justice. Justice is not only a matter of law, it is also even more fundamentally a matter of ethics. 3. What does ethics have to do with data? ‘Data’ refers to any form of recorded information, but today most of the data we use is recorded, stored, and accessed in digital form, whether as text, audio, video, still images, or other media. Networked societies generate an unending torrent of such data, through our interactions with our digital devices and a physical environment increasingly configured to read and record data about us. Big Data is a widely used label for the many new computing practices that depend upon this century’s rapid expansion in the volume and scope of digitally recorded data that can be collected, stored, and analyzed. Thus ‘big data’ refers to more than just the existence and explosive growth of large digital datasets; it also refers to the new techniques, organizations, and processes that are necessary to transform large datasets into valuable human knowledge. The big data phenomenon has been enabled by a wide range of computing innovations in data generation, mining, scraping, and sampling; artificial intelligence and machine learning; natural language and image processing; computer modeling and simulation; cloud computing and storage, and many others. Thanks to our increasingly sophisticated tools for turning large datasets into useful insights, new industries have sprung up around the production of various forms of data analytics, including predictive analytics and user analytics. Ethical issues are everywhere in the world of data, because data’s collection, analysis, transmission and use can and often does profoundly impact the ability of individuals and groups to live well. For example, which of these life-impacting events, both positive and negative, might be the direct result of data practices? A. Rosalina, a promising and hard-working law intern with a mountain of student debt and a young child to feed, is denied a promotion at work that would have given her a livable salary and a stable career path, even though her work record made her the objectively best candidate for the promotion. B. John, a middle-aged father of four, is diagnosed with an inoperable, aggressive, and advanced brain tumor. Though a few decades ago his tumor would probably have been judged untreatable and he would have been sent home to die, today he receives a customized treatment that in people with his very rare tumor gene variant, has a 75% chance of leading to full remission. C. The Patels, a family of five living in an urban floodplain in India, receive several days advance warning of an imminent, epic storm that is almost certain to bring life-threatening floodwaters to their neighborhood. They and their neighbors now have sufficient time to gather their belongings and safely evacuate to higher ground. D. By purchasing personal information from multiple data brokers operating in a largely unregulated commercial environment, Peter, a violent convict who was just paroled, is able to obtain a large volume of data about the movements of his ex-wife and stepchildren, who he was jailed for physically assaulting, and which a restraining order prevents him from contacting. Although his ex-wife and her children have changed their names, have no public social media accounts, and have made every effort to conceal their location from him, he is able to infer from his data purchases their new names, their likely home address, and the names of the schools his ex-wife’s children now attend. They are never notified that he has purchased this information. Which of these hypothetical cases raise ethical issues concerning data? The answer, as you probably have guessed, is ‘All of them.’ Rosalina’s deserved promotion might have been denied because her law firm ranks employees using a poorly-designed predictive HR software package trained on data that reflects previous industry hiring and promotion biases against even the best-qualified women and minorities, thus perpetuating the unjust bias. As a result, especially if other employers in her field use similarly trained software, Rosalina might never achieve the economic security she needs to give her child the best chance for a good life, and her employer and its clients lose out on the promise of the company’s best intern. John’s promising treatment plan might be the result of his doctors’ use of an AI-driven diagnostic support system that can identify rare, hard-to-find patterns in a massive sea of cancer patient treatment data gathered from around the world, data that no human being could process or analyze in this way even if given an entire lifetime. As a result, instead of dying in his 40’s, John has a great chance of living long enough to walk his daughters down the aisle at their weddings, enjoying retirement with his wife, and even surviving to see the birth of his grandchildren. The Patels might owe their family’s survival to advanced meterological data analytics software that allows for much more accurate and precise disaster forecasting than was ever possible before; local governments in their state are now able to predict with much greater confidence which cities and villages a storm is likely to hit and which neighborhoods are most likely to flood, and to what degree. Because it is often logistically impossible or dangerous to evacuate an entire city or region in advance of a flood, a decade ago the Patels and their neighbors would have had to watch and wait to see where the flooding will hit, and perhaps learn too late of their need to evacuate. But now, because these new data analytics allow officials to identify and evacuate only those neighborhoods that will be most severely affected, the Patels lives are saved from destruction. Peter’s ex-wife and her children might have their lives endangered by the absence of regulations on who can purchase and analyze personal data about them that they have not consented to make public. Because the data brokers Peter sought out had no internal policy against the sale of personal information to violent felons, and because no law prevented them from making such a sale, Peter was able to get around every effort of his victims to evade his detection. And because there is no system in place allowing his ex-wife to be notified when someone purchases personal information about her or her children, or even a way for her to learn what data about her is available for sale and by whom, she and her children get no warning of the imminent threat that Peter now poses to their lives, and no chance to escape. The combination of increasingly powerful but also potentially misleading or misused data analytics, a data-saturated and poorly regulated commercial environment, and the absence of widespread, well-designed standards for data practice in industry, university, non-profit, and government sectors has created a ‘perfect storm’ of ethical risks. Managing those risks wisely requires understanding the vast potential for data to generate ethical benefits as well. But this doesn’t mean that we can just ‘call it a wash’ and go home, hoping that everything will somehow magically ‘balance out.’ Often, ethical choices do require accepting difficult trade-offs. But some risks are too great to ignore, and in any event, we don’t want the result of our data practices to be a ‘wash.’ We don’t actually want the good and bad effects to balance! Remember, the whole point of scientific and technical innovation is to make lives better, to maximize the human family’s chances of living well and minimize the harms that can obstruct our access to good lives. Developing a broader and better understanding of data ethics, especially among those who design and implement data tools and practices, is increasingly recognized as essential to meeting this goal of beneficial data innovation and practice. This module provides an introduction to some key issues in data ethics, with working examples and questions for students that prompt active ethical reflection on the issues. Instructors and students using the module do not need to have any prior exposure to data ethics or ethical theory to use the module. However, this is only an introduction; thinking about data ethics can begin here, but it should not stop here. One big challenge for teaching data ethics is the immense territory the subject covers, given the ever-expanding variety of contexts in which data practices are used. Thus no single set of ethical rules or guidelines will fit all data circumstances; ethical insights in data practice must be adapted to the needs of many kinds of data practitioners operating in different contexts. This is why many companies, universities, non-profit agencies, and professional societies whose members develop or rely upon data practices are funding an increasing number of their own data ethics-related programs and training tools. PART ONE What ethically significant harms and benefits can data present? 1. What makes a harm or benefit ‘ethically significant’? In the Introduction we saw that the ‘good life’ is what ethical action seeks to protect and promote. We’ll say more later about the ‘good life’ and why we are ethically obligated to care about the lives of others beyond ourselves. But for now, we can define a harm or a benefit as ‘ethically significant’ when it has a substantial possibility of making a difference to certain individuals’ chances of having a good life, or the chances of a group to live well: that is, to flourish in society together. Some harms and benefits are not ethically significant. Say I prefer Coke to Pepsi. IfI ask for a Coke and you hand me a Pepsi, even if I am disappointed, you haven’t impacted my life in any ethically significant way. Some harms and benefits are too trivial to make a meaningful difference to how our life goes. Also, ethics implies human choice; a harm that is done to me by a wild tiger or a bolt of lightning might be very significant, but won’t be ethically significant, for it’s unreasonable to expect a tiger or a bolt of lightning to take my life or welfare into account. Ethics also requires more than ‘good intentions’: many unethical choices have been made by persons who meant no harm, but caused great harm anyway, by acting with recklessness, negligence, bias, or blameworthy ignorance of relevant facts.4 In many technical contexts, such as the engineering, manufacture, and use of aeronautics, nuclear power containment structures, surgical devices, buildings, and bridges, it is very easy to see the ethically significant harms that can come from poor technical choices, and very easy to see the ethically significant benefits of choosing to follow the best technical practices known to us. All of these contexts present obvious issues of ‘life or death’ in practice; innocent people will die if we disregard public welfare and act negligently or irresponsibly, and people will generally enjoy better lives if we do things right. Because ‘doing things right’ in these contexts preserves or even enhances the opportunities that other people have to enjoy a good life, good technical practice in such contexts is also ethical practice. A civil engineer who willfully or recklessly ignores a bridge design specification, resulting in the later collapse of said bridge and the deaths of a dozen people, is not just bad at his or her job. Such an engineer is also guilty of an ethical failure—and this would be true even if they just so happened to be shielded from legal, professional, or community punishment for the collapse. In the context of data practice, the potential harms and benefits are no less real or ethically significant, up to and including matters of life and death. But due to the more complex, abstract, and often widely distributed nature of data practices, as well as the interplay of technical, social, and individual forces in data contexts, the harms and benefits of data can be harder to see and anticipate. This part of the module will help make them more recognizable, and hopefully, easier to anticipate as they relate to our choices. 2. What significant ethical benefits and harms are linked to data? One way of thinking about benefits and harms is to understand what our life interests are; like all animals, humans have significant vital interests in food, water, air, shelter, and bodily integrity. But we also have strong life interests in our health, happiness, family, friendship, social reputation, liberty, autonomy, knowledge, privacy, economic security, respectful and fair treatment by others, education, meaningful work, and opportunities for leisure, play, entertainment, and creative and political expression, among other things.5 What is so powerful about data practice is that it has the potential to significantly impact all of these fundamental interests of human beings. In this respect, then, data has a broader ethical sweep than some ofthe stark examples oftechnical practice given earlier, such as the engineering of bridges and airplanes. Unethical design choices in building bridges and airplanes can destroy bodily integrity and health, and through such damage make it harder for people to flourish, but unethical choices in the use of data can cause many more different kinds of harm. While selling my personal data to the wrong person could in certain scenarios cost me my life, as we noted in the Introduction, mishandling my data could also leave my body physically intact but my reputation, savings, or liberty destroyed. Ethical uses of data can also generate a vast range of benefits for society, from better educational outcomes and improved health to expanded economic security and fairer institutional decisions. Because of the massive scope of social systems that data touches, and the difficulty of anticipating what might be done by or to others with the data we handle, data practitioners must confront a far more complex ethical landscape than many other kinds of technical professionals, such as civil and mechanical engineers, who might limit their attention to a narrow range of goods such as public safety and efficiency. ETHICALLY SIGNIFICANT BENEFITS OF DATA PRACTICES The most common benefits of data are typically easier to understand and anticipate than the potential harms, so we will go through these fairly quickly: 1. HUMAN UNDERSTANDING: Because data and its associated practices can uncover previously unrecognized correlations and patterns in the world, data can greatly enrich our understanding of ethically significant relationships—in nature, society, and our personal lives. Understanding the world is good in itself, but also, the more we understand about the world and how it works, the more intelligently we can act in it. Data can help us to better understand how complex systems interact at a variety of scales: from large systems such as weather, climate, markets, transportation, and communication networks, to smaller systems such as those of the human body, a particular ecological niche, or a specific political community, down to the systems that govern matter and energy at subatomic levels. Data practice can also shed new light on previously unseen or unattended harms, needs, and risks. For example, big data practices can reveal that a minority or marginalized group is being harmed by a drug or an educational technique that was originally designed for and tested only on a majority/dominant group, allowing us to innovate in safer and more effective ways that bring more benefit to a wider range of people. 2. SOCIAL, INSTITUTIONAL, AND ECONOMIC EFFICIENCY: Once we have a more accurate picture of how the world works, we can design or intervene in its systems to improve their functioning. This reduces wasted effort and resources and improves the alignment between a social system or institution’s policies/processes and our goals. For example, big data can help us create better models of systems such as regional traffic flows, and with such models we can more easily identify the specific changes that are most likely to ease traffic congestion and reduce pollution and fuel use—ethically significant gains that can improve our happiness and the environment. Data used to better model voting behavior in a given community could allow us to identify the distribution of polling station locations and hours that would best encourage voter turnout, promoting ethically significant values such as citizen engagement. Data analytics can search for complex patterns indicating fraud or abuse of social systems. The potential efficiencies of big data go well beyond these examples, enabling social action that streamlines access to a wide range of ethically significant goods such as health, happiness, safety, security, education, and justice. 3. PREDICTIVE ACCURACY AND PERSONALIZATION: Not only can good data practices help to make social systems work more efficiently, as we saw above, but they can also used to more precisely tailor actions to be effective in achieving good outcomes for specific individuals, groups, and circumstances, and to be more responsive to user input in (approximately) real time. Of course, perhaps the most well-known examples of this advantage of data involves personalized search and serving of advertisements. Designers of search engines, online advertising platforms, and related tools want the content they deliver to you to be the most relevant to you, now. Data analytics allow them to predict your interests and needs with greater accuracy. But it is important to recognize that the predictive potential of data goes well beyond this familiar use, enabling personalized and targeted interactions that can deliver many kinds of ethically significant goods. From targeted disease therapies in medicine that are tailored specifically to a patient’s genetic fingerprint, to customized homework assignments that build upon an individual student’s existing skills and focus on practice in areas of weakness, to predictive policing strategies that send officers to the specific locations where crimes are most likely to occur, to timely predictions of mechanical failure or natural disaster, a key goal of data practice is to more accurately fit our actions to specific needs and circumstances, rather than relying on more sweeping and less reliable generalizations. In this way the choices we make in seeking the good life for ourselves and others can be more effective more often, and for more people. ETHICALLY SIGNIFICANT HARMS OF DATA PRACTICES Alongside the ethically significant benefits of data are ways in which data practice can be harmful to our chances of living well. Here are some key ones: 1. HARMS TO PRIVACY & SECURITY: Thanks to the ocean of personal data that humans are generating today (or, to use a better metaphor, the many different lakes, springs, and rivers of personal data that are pooling and flowing across the digital landscape), most of us do not realize how exposed our lives are, or can be, by common data practices. Even anonymized datasets can, when linked or merged with other datasets, reveal intimate facts (or in many cases, falsehoods) about us. As a result of your multitude of data- generating activities (and of those you interact with), your sexual history and preferences, medical and mental health history, private conversations at work and at home, genetic makeup and predispositions, reading and Internet search habits, political and religious views, may all be part of data profiles that have been constructed and stored somewhere unknown to you, often without your knowledge or informed consent. Such profiles exist within a chaotic data ecosystem that gives individuals little to no ability to personally curate, delete, correct, or control the release of that information. Only thin, regionally inconsistent, and weakly enforced sets of data regulations and policies protect us from the reputational, economic, and emotional harms that release of such intimate data into the wrong hands could cause. In some cases, as with data identifying victims of domestic violence, or political protestors or sexual minorities living under oppressive regimes, the potential harms can even be fatal. And of course, this level of exposure does not just affect you but virtually everyone in a networked society. Even those who choose to live ‘off the digital grid’ cannot prevent intimate data about them from being generated and shared by their friends, family, employers, clients, and service providers. Moreover, much of this data does not stay confined to the digital context in which it was originally shared. For example, information about an online purchase you made in college of a politically controversial novel might, without your knowledge, be sold to third- parties (and then sold again), or hacked from an insecure cloud storage system, and eventually included in a digital profile of you that years later, a prospective employer or investigative journalist could purchase. Should you, and others, be able to protect your employability or reputation from being irreparably harmed by such data flows? Data privacy isn’t just about our online activities, either. Facial, gait, and voice-recognition algorithms, as well as geocoded mobile data, can now identify and gather information about us as we move and act in many public and private spaces. Unethical or ethically negligent data privacy practices, from poor data security and data hygiene, to unjustifiably intrusive data collection and data mining, to reckless selling of user data to third- parties, can expose others to profound and unnecessary harms. 2. HARMS TO FAIRNESS AND JUSTICE: We all have a significant life interest in being judged and treated fairly, whether it involves how we are treated by law enforcement and the criminal and civil court systems, how we are evaluated by our employers and teachers, the quality of health care and other services we receive, or how financial institutions and insurers treat us. All of these systems are being radically transformed by new data practices and analytics, and the preliminary evidence suggests that the values of fairness and justice are too often endangered by poor design and use of such practices. The most common causes of such harms are: arbitrariness; avoidable errors and inaccuracies; and unjust and often hidden biases in datasets and data practices. For example, investigative journalists have found compelling evidence of hidden racial bias in data-driven predictive algorithms used by parole judges to assess convicts’ risk of reoffending.6 Of course, bias is not always harmful, unfair, or unjust. A bias against, for example, convicted bank robbers when reviewing job applications for an armored-car driver is entirely reasonable! But biases that rest on falsehoods, sampling errors, and unjustifiable discriminatory practices are all too common in data practice. Typically, such biases are not explicit, but implicit in the data or data practice, and thus harder to see. For example, in the case involving racial bias in criminal risk-predictive algorithms cited above, the race of the offender was not in fact a label or coded variable in the system used to assign the risk score. The racial bias in the outcomes was not intentionally placed there, but rather ‘absorbed’ from the racially-biased data the system was trained on. We use the term ‘proxies’ to describe how data that are not explicitly labeled by race, gender, location, age, etc. can still function as indirect but powerful indicators of those properties, especially when combined with other pieces of data. A very simple example is the function of a zip code as a strong proxy, in many neighborhoods, for race or income. So, a risk-predicting algorithm could generate a racially-biased prediction about you even if it is never ‘told’ your race. This makes the bias no less harmful or unjust; a criminal risk algorithm that inflates the actual risk presented by black defendants relative to otherwise similar white defendants leads to judicial decisions that are wrong, both factually and morally, and profoundly harmful to those who are misclassified as high- risk. If anything, implicit data bias is more dangerous and harmful than explicit bias, since it can be more challenging to expose and purge from the dataset or data practice. In other data practices the harms are driven not by bias, but by poor quality, mislabeled, or error-riddled data (i.e., ‘garbage in, garbage out’); inadequate design and testing of data analytics; or a lack of careful training and auditing to ensure the correct implementation and use of the data system. For example, such flawed data practices by a state Medicaid agency in Idaho led it to make large, arbitrary, and very possibly unconstitutional cuts in disability benefit paymentstoover4,000ofitsmostvulnerablecitizens.7 In Michigan, flawed at a practices led another agency to levy false fraud accusations and heavy fines against at least 44,000 of its innocent, unemployed citizens for two years. It was later learned that its data-driven decision- support system had been operating at a shockingly high false- positive error rate of 93 percent.8 While not all such cases will involve datasets on the scale typically associated with ‘big data’, they all involve ethically negligent failures to adequately design, implement and audit data practices to promote fair and just results. Such failures of ethical data practice, whether in the use of small datasets or the power of ‘big data’ analytics, can and do result in economic devastation, psychological, reputational, and health damage, and for some victims, even the loss of their physical freedom. 3. HARMS TO TRANSPARENCY AND AUTONOMY: In this context, transparency is the ability to see how a given social system or institution works, and to be able to inquire about the basis of life-affecting decisions made within that system or institution. So, for example, if your bank denies your application for a home loan, transparency will be served by you having access to information about exactly why you were denied the loan, and by whom. Autonomy is a distinct but related concept; autonomy refers to one’s ability to govern or steer the course of one’s own life. If you lack autonomy altogether, then you have no ability to control the outcome of your life and are reliant on sheer luck. The more autonomy you have, the more your chances for a good life depend on your own choices. The two concepts are related in this way; to be effective at steering the course of my own life (to be autonomous), I must have a certain amount of accurate information about the other forces acting upon me in my social environment (that is, I need some transparency in the workings of my society). Consider the example given above: if I know why I was denied the loan (for example, a high debt-to-asset ratio), I can figure out what I need to change to be successful in a new application, or in an application to another bank. The fate of my aspiration to home ownership remains at least somewhat in my control. But if I have no information to go on, then I am blind to the social forces blocking my aspiration, and have no clear way to navigate around them. Data practices have the potential to create or diminish social transparency, but diminished transparency is currently the greater risk because of two factors. The first risk factor has to do with the sheer volume and complexity of today’s data, and of the algorithmic techniques driving big data practices. For example, machine learning algorithms trained on large datasets can be used to make new assessments based on fresh data; that is why they are so useful. The problem is that especially with ‘deep learning’ algorithms, it can be difficult or impossible to reconstruct the machine’s ‘reasoning’ behind any particular judgment.9 This means that if my loan was denied on the basis of this algorithm, the loan officer and even the system’s programmers might be unable to tell me why—even if they wanted to. And it is unclear how I would appeal such an opaque machine judgment, since I lack the information needed to challenge its basis. In this way my autonomy is restricted. Because of the lack of transparency, my choices in responding to a life-affecting social judgment about me have been severely limited. The second risk factor is that often, data practices are cloaked behind trade secrets and proprietary technology, including proprietary software. While laws protecting intellectual property are necessary, they can also impede social transparency when the protected property (the technique or invention) is a key part of the mechanisms of social functioning. These competing interests in intellectual property rights and social transparency need to be appropriately balanced. In some cases the courts will decide, as they did in the aforementioned Idaho case. In that case, K.W. v. Armstrong, a federal court ruled that citizens’ due process was violated when, upon requesting the reason for the cuts to their disability benefits, the citizens were told that trade secrets prevented releasing that information.10 Among the remedies ordered by the court was a testing regime to ensure the reliability and accuracy of the automated decision- support systems used by the state. However, not every obstacle to data transparency can or should be litigated in the courts. Securing an ethically appropriate measure of social transparency in data practices will require considerable public discussion and negotiation, as well as good faith efforts by data practitioners to respect the ethically significant interest in transparency. You now have an overview of many common and significant ethical issues raised by data practices. But the scope of these issues is by no means limited to those in Part One. Data practitioners need to be attentive to the many ways in which data practices can significantly impact the quality of people’s lives, and must learn to better anticipate their potential harms and benefits so that they can be effectively addressed. Case Study 1 Fred and Tamara, a married couple in their 30’s, are applying for a business loan to help them realize their long-held dream of owning and operating their own restaurant. Fred is a highly promising graduate of a prestigious culinary school, and Tamara is an accomplished accountant. They share a strong entrepreneurial desire to be ‘their own bosses’ and to bring something new and wonderful to their local culinary scene; outside consultants have reviewed their business plan and assured them that they have a very promising and creative restaurant concept and the skills needed to implement it successfully. The consultants tell them they should have no problem getting a loan to get the business off the ground. For evaluating loan applications, Fred and Tamara’s local bank loan officer relies on an off- the- shelf software package that synthesizes a wide range of data profiles purchased from hundreds of private data brokers. As a result, it has access to information about Fred and Tamara’s lives that goes well beyond what they were asked to disclose on their loan application. Some of this information is clearly relevant to the application, such as their on- time bill payment history. But a lot of the data used by the system’s algorithms is of the sort that no human loan officer would normally think to look at, or have access to—including inferences from their drugstore purchases about their likely medical histories, information from online genetic registries about health risk factors in their extended families, data about the books they read and the movies they watch, and inferences about their racial background. Much of the information is accurate, but some of it is not. A few days after they apply, Fred and Tamara get a call from the loan officer saying their loan was not approved. When they ask why, they are told simply that the loan system rated them as ‘moderate-to-high risk.’ When they ask for more information, the loan officer says he doesn’t have any, and that the software company that built their loan system will not reveal any specifics about the proprietary algorithm or the data sources it draws from, or whether that data was even validated. In fact, they are told, not even the system’s designers know how what data led it to reach any particular result; all they can say is that statistically speaking, the system is ‘generally’ reliable. Fred and Tamara ask if they can appeal the decision, but they are told that there is no means of appeal, since the system will simply process their application again using the same algorithm and data, and will reach the same result. With this case study, think about: What ethically significant harms, as defined in Part One, might Fred and Tamara have suffered as a result of their loan denial? What sort of ethically significant benefits, as defined in Part One, could come from banks using a big-data driven system to evaluate loan applications? Beyond the impacts on Fred and Tamara’s lives, what broader harms to society could result from the widespread use of this particular loan evaluation process? Could the harms you listed in 1.1 and 1.3 have been anticipated by the loan officer, the bank’s managers, and/or the software system’s designers and marketers? Should they have been anticipated, and why or why not? 2 Plato, Crito 48b. 3 https://techethics.ieee.org 4 Even acts performed without any direct intent, such as driving through a busy crosswalk while drunk, or unwittingly exposing sensitive user data to hackers, can involve ethical choice (e.g., the reckless choice to drink and get behind the wheel, or the negligent choice to use subpar data security tools) 5 See Robeyns (2016) https://plato.stanford.edu/entries/capability-approach/) for a helpful overview ofthe highly influential capabilities approach to identifying these fundamental interests in human life. 6 See the ProPublica series on ‘Machine Bias’ published by Angwin et. al. (2016). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing 7 See Stanley (2017) https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence- decisionmaking-highlighted-idaho-aclu-case 8 See Egan (2017) http://www.freep.com/story/news/local/michigan/2017/07/30/fraud-charges- unemployment-jobless-claimants/516332001/ and Levin (2016) https://levin.house.gov/press- release/state%E2%80%99s-automated-fraud-system-wrong-93-reviewed-unemployment-cases-2013-2105 For discussion of the broader issues presented by these cases of bias in institutional data practice see Cassel (2017) https://thenewstack.io/when-ai-is-biased/ 9 See Knight (2017) https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/ for a discussion of this problem and its social and ethical implications. What are ethical best practices for data practitioners? The phrase ‘best practices’ refers to known techniques for doing something that tend to work well, better than the alternative ways of doing something. It’s not a phrase unique to ethics, in fact it’s used in a range of corporate and government settings; but it’s often used in contexts where it is very important that the thing be done well, and where there are significant costs or risks to doing it in a less than optimal way. For data practitioners, we describe two types of best practices. The first set focuses on best practices for functioning ethically in data practice; they are adapted specifically to the ethical challenges that we studied in Part Two of this module. The second set identifies best practices for living and acting ethically in general; these practices can be adopted by anyone, regardless of their career or professional interests. Data practitioners can benefit from drawing upon both sets of practices in creative ways to manage ethical challenges wisely and well. 1. BEST PRACTICES FOR DATA ETHICS As noted in the Introduction, no single, detailed code of data ethics can be fitted to all data contexts and practitioners; organizations and data-related professions should therefore be encouraged to develop explicit internal policies, procedures, guidelines and best practices for data ethics that are specifically adapted to their own activities (e.g., data science, machine learning, data security and storage, data privacy protection, medical and scientific research, etc.) However, those specific codes of practice can be well shaped by reflecting on these 14 general norms and guidelines for ethical data practice: I. Keep Data Ethics in the Spotlight—and Out of the Compliance Box: As earlier modules and examples have shown, data ethics is a pervasive aspect of data practice. Because of the immense social power of data, ethical issues are virtually always actively in play when we handle data. Even when our work is highly technical and not directly client-facing, ethical issues are never simply absent from the context of our work. However, the ‘compliance mindset’ found many organizations, especially concerning legal matters, can, when applied to data ethics, encourage a dangerous tendency to ‘sideline’ ethics as an external constraint rather than see it as an integral part of our daily work. If we fall victim to that mindset, we are more likely to view our ethical obligations as a box to ‘check off’ and then happily forget about, once we feel we have done the minimum needed to ‘comply’ with our ethical obligations. Unfortunately, this often leads to disastrous consequences, for individuals and organizations alike. Because data practice involves ethical considerations that are ubiquitous and central, not intermittent and marginal, our individual and organizational efforts need to strive to keep ethics in the spotlight. II. Consider the Human Lives and Interests Behind the Data: E specially in technical contexts, it’s easy to lose sight of what most of the data we work with are: namely, reflections of human lives and interests. Even when the data we handle are generated by non-human entities (for example, recordings of ocean temperatures), these data are being collected for important human purposes and interests. And much of the data under the ‘big data’ umbrella concern the most sensitive aspects of human lives: the condition of people’s bodies, their finances, their social likes and dislikes, or their emotional and mental states. A decent human would never handle another person’s body, money, or mental condition without due care; but it can be easy to forget that this is often what we are doing when we handle data. III. Focus on Downstream Risks and Uses of Data: As noted above, often we focus too narrowly on whether we have complied with ethical guidelines and we forget that ethical issues concerning data don’t just ‘go away’ once we have performed a particular task diligently. Thus it is essential to think about what happens to or with the data later on, even after it leaves our hands. Even if, for example, we obtained explicit and informed consent to collect certain data from a subject, we cannot ignore how that data might impact the subject, or others, down the road. If the data poses clear risks of harm if inappropriately used or disclosed, then I should be asking myself where that data might be five or ten years from now, in whose hands, for what purposes, and with what safeguards. I should also consider how long that data will remain accurate and relevant, or how its sensitivity and vulnerability to abuse might increase in time. If I can’t answer any of those questions—or have not even asked them—then I have not fully appreciated the ethical stakes of my current data practice. IV. Don’t Miss the Forest for the Trees: Envision the Data Ecosystem: This is related to the former item; but broader in scope. Not only is it important to keep in view where the data I handle today is going tomorrow, and for what purpose, I also need to keep in mind the full context in which it exists now. For example, if I am a university genetics researcher handling a large dataset of medical records, I might be inclined to focus narrowly on how I will collect and use the genetic data responsibly. But I also have to think about who else might have an interest in obtaining such data, and for different purposes than mine (for example, employers and insurance companies). I may have to think about the cultural and media context in which I’m collecting the data, which might embody expectations, values, and priorities concerning the collection and use of personal genetic data that conflict with those of my academic research community. I may need to think about where the server or cloud storage company I’m currently using to store the data is located, and what laws and standards for data security exist there. The point here is that my data practices are never isolated from a broader data ecosystem that includes powerful social forces and instabilities not under my control; it is essential that I consider my ethical practices and obligations in light of that bigger social picture. V. Mind the Gap Between Expectations and Reality: When collecting or handling personal or otherwise sensitive data, it’s essential that I keep in mind how the expectations of data subjects or other stakeholders may vary from reality. For example, do my data subjects know as much about the risks of data disclosure (from hacking, phishing, etc.) as I do? Might my data disclosure and use policy lead to inflated expectations about how safe users’ data are from such threats? Do I intend to use this data for additional purposes beyond what the consenting subjects would know about or reasonably anticipate? Can I keep all the promises I have made to my data subjects, or do I know that there is a good chance that their expectations will not be met? For example, might I one day sell my product and/or its associated data to a third-party who may not honor those promises? Often we make the mistake of regarding parties we contract with as information equals, when we may in fact operate from a position of epistemic advantage— we know a lot more than they do. Agreements with data subjects who are ‘in the dark’ or subject to illusions about the nature of the data agreement are not, in general, ethically legitimate. VI. Treat Data as a Conditional Good: Some of the most dangerous data practices involve treating data as unconditionally good. One such practice is to follow the policy of ‘collect and store it all now, and figure out what we actually need later.’ Data (at least good data) is incredibly useful, but its power also makes it capable of doing damage. Think about personal data like guns: only some of us should be licensed to handle guns, and even those of us who are licensed should keep only as many guns as we can reasonably think we actually need, since they are so often stolen or misused in harmful ways. The same is often true for sensitive data. We should collect only as much of it as we need, when we need it, store it carefully for only as long as we need it, and purge it when we no longer need it. The second dangerous practice that treats data as an unconditional good is the flawed policy that more data is always better, regardless of data quality or the reliability of the source. The motto ‘garbage in, garbage out’ is of critical importance to remember, and just because our algorithms and systems are incredibly thirsty for data, doesn’t mean that we should open the firehose and send them all the data we can get our hands on— especially if that data is dirty, incomplete, or unreliably sourced. Data are a conditional good— only as beneficial and useful as we take the care to make them. VII. Avoid Dangerous Hype and Myths around ‘Big Data’: Data is powerful, but it isn’t magic, and it isn’t a silver bullet for complex social problems. There are, however, significant industry and media incentives to portray ‘big data’ as exactly that. This can lead to many harms, including unrealized hopes and expectations that can easily lead to consumer, client, and media backlash. The saying ‘to a man with a hammer, everything looks like a nail’ is also instructive here. Not all problems have a big data solution, and we may overlook more economical and practical solutions if we believe otherwise. We should also remember the joke about the drunk man who, when asked why he’s looking for his lost car keys under the street lamp, says ‘because that’s where the light is.’ For some problems we have abundant sources of high-quality, relevant data and powerful analytics that can use them to produce new insights and solutions. For others, we don’t. But we shouldn’t ignore problems that might require other kinds of solutions, or employ inappropriate solutions, just because we are in the thrall of ‘big data’ hype. VIII. Establish Chains of Ethical Responsibility and Accountability: In organizational settings, the ‘problem of many hands’ is a constant challenge to responsible practice and accountability. To avoid a diffusion of responsibility in which no one on a team may feel empowered or obligated to take the steps necessary to ensure ethical data practice, it is important that clear chains of responsibility are established and made explicit to everyone involved in the work, at the earliest possible stages of a project. It should be clear who is responsible for each aspect of ethical risk management and prevention of harm, in each of the relevant areas of risk- laden activity (data collection, use, security, analysis, disclosure, etc.) It should also be clear who is ultimately accountable for ensuring an ethically executed project or practice. Who will be expected to provide answers, explanations, and remedies if there is a failure of ethics or significant harm caused by the team’s work? The essential function of chains of responsibility and accountability is to assure that members of a data-driven project or organization take explicit ownership of the work’s ethical significance. IX. Practice Data Disaster Planning and Crisis Response: Most people don’t want to anticipate failure, disaster, or crisis; they want to focus on the positive potential of a project. While this is understandable, the dangers of this attitude are well known, and have often caused failure, disaster, or crisis that could easily have been avoided. This attitude also often prevents effective crisis response since there is no planning for a worst-case-scenario. This is why engineering fields whose designs can impact public safety have long had a culture of encouraging thinking about failure. Understanding how a product will function in non-ideal conditions, at the boundaries of intended use, or even outside those boundaries, is essential to building in appropriate margins of safety and developing a plan for product failures or other unwelcome scenarios. Thinking about failure makes engineers’ work better, not worse. Data practitioners must begin to develop the same cultural habit in their work. Known failures should be carefully analyzed and discussed (‘post-mortems’) and results projected into the future. ‘Pre- mortems’ (imagining together how a current project could fail or produce a crisis, so that we can design to prevent that outcome) can be a great data practice. It’s also essential to develop crisis plans that go beyond deflecting blame or denying harm (often the first mistake of a PR team when the harm is evident). Crisis plans should be intelligent, responsive to public input, and most of all, able to effectively mitigate or remedy harm being done. This is much easier to plan before a crisis has actually happened. X. Promote Values of Transparency, Autonomy, and Trustworthiness: The most important thing to preserve a healthy relationship between data practitioners and the public is for data practitioners to understand the importance of transparency, autonomy, and trustworthiness to that relationship. Hiding a risk or a problem behind legal language, disempowering users or data subjects, and betraying public trust are almost never good strategies in the long run. Clear and understandable data collection, use, and privacy policies, when those policies give users and data subjects actionable information and encourage them to use it, help to promote these values. Favoring ‘opt-in’ rather than ‘opt-out’ options and offering other clear avenues of choice for data participants can enhance autonomy and transparency, and promote greater trust. Of course, we can’t always be completely transparent about everything we do with data: company interests, intellectual property rights, and privacy concerns of other parties often require that we balance transparency with other legitimate goods and interests. Likewise, sometimes the autonomy of users will be in tension with our obligations to prevent harmful misuse of data. But balancing transparency and autonomy with other important rights and ethical values is not the same as sacrificing these values or ignoring their critical role in sustaining public trust in data-driven practices and organizations. XI. Consider Disparate Interests, Resources, and Impacts: It is important to understand the profound risk in many data practices of producing or magnifying disparate impacts; that is, of making some people better off and others worse off, whether this is in terms of their social share of economic well-being, political power, health, justice, or other important goods. Not all disparate impacts are unjustifiable or wrong. For example, an app that flags businesses with a high number of consumer complaints and lawsuits will make those businesses worse off relative to others in the same area—but if the app and its data are sufficiently reliable, then there’s an argument that this disparate impact is a good thing. But imagine another app, created for the same purpose, that sources its data from consumer complaints in a way that reflects and magnifies existing biases in a given region against women business owners, business owners of color, and business owners from certain religious backgrounds. The fact that more complaints per capita are registered against those businesses might be an artifact of those harmful biases in the region, which my app then just blindly replicates and reinforces. This is why there ought to be a presumption in data practice of ethical risk from disparate impacts; they must be anticipated, actively audited for, and carefully examined for their ethical acceptability. Likewise, we must investigate the extent to which different populations affected by our practice have different interests and resources, that give them a differential ability to benefit from our product or project. If a data-driven product produces immense health benefits but is inaccessible to people who are blind, deaf, or non-native English speakers, or to people who cannot afford the latest high-end mobile devices, then there are disparate impacts of this work that at a minimum must be reflected upon and evaluated. XII. Invite Diverse Stakeholder Input: One way to avoid ‘groupthink’ in ethical risk assessment and design is to invite input from diverse stakeholders outside of the team and organization. It is important that stakeholder input not simply reflect the same perspectives one already has within the organization. Often, data practitioners work in fields with unusually high levels of educational achievement and economic status, and in many technical fields, there may be skewed representation of the population in terms of gender, ethnicity, age, disability, and other characteristics. Also, the nature of the work may attract people who have common interests and values, for example, a shared optimism about the potential of science and technology to promote social good, and comparatively less faith in other social mechanisms. All of these factors can lead to organizational monocultures, which magnify the dangers of groupthink, blind spots, and insularity of interests. For example, many of the best practices above can’t be carried out successfully if members of a team struggle to imagine how a data practice would be perceived by, or how it might affect, people unlike themselves. Actively recognizing the limitations of a team perspective is essential. Fostering more diverse data organizations and teams is one obvious way to mitigate those limitations, but soliciting external input from a more truly representative body of those likely to be impacted by our data practice is another. XIII. Design for Privacy and Security: This might seem like an obvious one, but nevertheless its importance can’t be overemphasized. ‘Design’ here means not only technical design (of databases, algorithms, or apps), but also social and organizational design (of groups, policies, procedures, incentives, resource allocations, and techniques) that promote data privacy and data security objectives. How this is best done in each context will vary, but the essential thing is that along with other project goals, the values of data privacy and security remain at the forefront of project design, planning, execution, and oversight, and are never treated as marginal, external, or ‘after-the-fact’ concerns. XIV. Make Ethical Reflection & Practice Standard, Pervasive, Iterative, and Rewarding: Ethical reflection and practice, as we have already said, is an essential and central part of professional excellence in data-driven applications and fields. Yet it is still in the process of being fully integrated into every data environment. The work of making ethical reflection and practice standard and pervasive, that is, accepted as a necessary, constant, and central component of every data practice, must continue to be carried out through active measures taken by individual data practitioners and organizations alike. Ethical reflection and practice in data environments must also, to be effective, be instituted in iterative ways. That is, because data practice is so increasingly complex in its interactions with society, we must treat data ethics as an active and unending learning cycle in which we continually observe the outcomes of our data practice, learn from our mistakes, gather more information, acquire further ethical expertise, and then update and improve our ethical practice accordingly. Most of all, ethical practice in data environments must be made rewarding: team, project, and institutional/company incentives must be well aligned with the ethical best practices described above, so that those practices are reinforced and so that data practitioners are empowered and given the necessary resources to carry them out.

An Introduction to Data Ethics PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue