8 AI Use Cases and Impacts Articles (WSJ & HBR) PDF
Document Details
Uploaded by Deleted User
null
2024
null
null
Tags
Summary
This document contains multiple articles about the use of AI in various fields, like healthcare diagnosis. The articles discuss the use of AI for cough diagnosis in particular, exploring its potential to revolutionise how doctors diagnose respiratory diseases. The analysis also highlights the practical application of red teaming and its importance in assessing the risks of generative AI.
Full Transcript
This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers visit https://www.djreprints.com. https://www.wsj.com/articles/diagnose-respiratory-illness-smartphone-11631041761...
This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers visit https://www.djreprints.com. https://www.wsj.com/articles/diagnose-respiratory-illness-smartphone-11631041761 | HEALTH Coughs Say a Lot About Your Health, if Your Smartphone Is Listening With apps and artificial intelligence, researchers want to use the cough to better diagnose asthma, Covid-19 and other respiratory illnesses ILLUSTRATION: BRIAN STAUFFER AUTHOR PUBLISHED BETSY MCKAY Follow SEPT. 8, 2021 10:05 AM ET READING TIME 7 MINUTE READ The Future of Everything covers the innovation and technology transforming the way we live, work and play, with monthly issues on education, money, cities and more. This month is Health, online starting Sept. 3 and in the paper on Sept. 10. Few things are as annoying—or in the pandemic era as frightening—as the sound of a cough. But that same sound could help save lives. Researchers around the world are trying to turn the humble cough into an inexpensive tool to diagnose and stop respiratory-disease killers like tuberculosis and Covid-19. They’re collecting recordings of millions of the explosive sounds from patients and consumers on smartphones and other devices. And they’re training artificial intelligence to find patterns to try to identify the type and severity of disease from the cough itself. “We call it acoustic epidemiology,” says Peter Small, a tuberculosis expert and chief medical officer of Hyfe Inc., a Delaware-based company with two free smartphone apps—one for consumers, another for researchers—that use AI to detect and track how often someone coughs. The sound and frequency of coughs are rich with medical information, he says. Different diseases have some audible differences: crackling in parts of the lung for pneumonia, a wheezing sound for asthma. Makers of these apps say there are sounds and patterns that AI can detect, but the human ear can’t hear. Coughs are one of the most common signs of potential illness, the body’s attempt to protect itself from irritation or unwanted matter in the airways. One of the top reasons people go to a doctor is for a cough. Yet doctors often can’t learn much about a patient’s cough during an office visit or on hospital rounds, Dr. Small says. “Patients will come in and say, ‘I’ve got a bad cough,’ but are they coughing 10 times a day or 400 times a day?” he says. “Pulmonologists will tell you they’re like cardiologists without a blood pressure cuff.” It’s hard for patients to recall how much they’re coughing, particularly at night, says Kaiser Lim, a pulmonologist at Mayo Clinic in Rochester, Minn. Monitors would help doctors quantify their coughs, leaving more time in a visit to address the problem and the psychological effects, he says. Covid-19, which has taken more than 4.5 million lives globally so far, has only added to longstanding challenges. Cough-related illnesses including lung cancer, tuberculosis, chronic obstructive pulmonary disease and pneumonia make up one-quarter of all deaths world-wide annually. Millions of people suffer from chronic coughs and respiratory conditions like asthma —including Dr. Small, who has had an unexplained chronic cough for a decade. Other common causes include allergies and acid reflux. Quiz: Guess the Cough Can you identify which of the following coughs is from someone with Covid-19? Bonus points if you can name the illnesses for the other coughs as well. See the answers at the end of this article. COUGH 1 00:00 / 00:03 1x COUGH 2 00:00 / 00:03 1x COUGH 3 00:00 / 00:03 1x COUGH 4 00:00 / 00:03 1x COUGH 5 00:00 / 00:03 1x Source: Dr. Paul Porter/Curtin University and ResApp Health Ltd. It will take some time before these tests are deployed widely; their makers are still building data sets and training AI to recognize the coughs. To have smartphones listening, they will also have to address concerns about privacy and data usage costs. “About one of every two people that dies of TB in South Africa never visited a health facility for TB,” says Grant Theron, a professor at Stellenbosch University in Cape Town, who is collecting coughs to develop a mobile phone-based triage test for tuberculosis. “Catching those people out in the community from an epidemiological perspective is so important.” With an audio-based test, “you can screen hundreds of thousands of people,” at a very low cost per patient, he says. People who test positive on the app would then be given a lab test to see if they have TB, he says. A TB cough has distinctive acoustic patterns that set it apart from other diseases, though they are difficult or sometimes impossible for the human ear to hear, according to Thomas Niesler, a professor of electrical and electronic engineering at Stellenbosch who is involved in the project. The devices don’t have to diagnose perfectly, says Adithya Cattamanchi, a professor of medicine and epidemiology at the University of California San Francisco who is using the Hyfe app to collect coughs in several countries for a large database of TB coughs. They could act more like a mammogram, alerting a doctor to a potential case and a need for more tests, he says. ResApp Health Ltd. is using an explosion in telehealth services, particularly during the pandemic, to expand use of an app-based test for cough sounds that helps doctors diagnose diseases including COPD, pneumonia, asthma and bronchitis, says Tony Keating, the Brisbane, Australia-based company’s chief executive. A telehealth provider asks a patient to hold a smartphone at arm’s length and record five coughs. The app analyzes the coughs, then sends the results to the doctor. Using technology developed by researchers at the University of Queensland, the company built the diagnostic tool by training an algorithm on recordings of 6,000 coughs, along with clinical data from those patients in the U.S. and Australia, Mr. Keating says. The company is also collecting coughs from Covid-19 patients whose illness is confirmed with a gold-standard PCR test and others in the U.S. and India to develop a screening test for the disease. The goal is both to diagnose the disease and to try to predict whether a patient will develop a severe form, Mr. Keating says. X-rays reveal unusual patterns in the lungs of Covid-19 patients, suggesting that the disease produces a unique cough sound, he says. “It’s not clear today if there are patterns in Covid-19 cough sounds like there are patterns in asthma or pneumonia cough sounds,” he says. Still, he says he is confident those patterns will be found. Research at the Massachusetts Institute of Technology, University of California San Diego and elsewhere suggests that the Covid-19 cough has detectable features. Cough tracking apps may alert patients and providers to an illness early, because people often start coughing before they feel sick, or tell them whether an illness is improving or worsening, according to Dr. Small of Hyfe. The apps can also passively monitor an office or nursing home, potentially detecting an outbreak or increase in respiratory illness if lots of people are coughing, he says. Hyfe’s two apps run continuously on a smartphone and record in half-second clips when the AI hears an explosive sound. The AI isn’t recording or interpreting any other sounds, Dr. Small says. The apps have recorded about 64 million sounds, about two million of which were identified by an algorithm as coughs. To teach the AI how to better recognize coughs, the company had humans—including an unemployed bartender in rural Spain—listen to one million of the sounds. Researchers are using one of the apps in more than a dozen studies around the world, including one to create the database of TB coughs and another to monitor the coughing patterns of a community in Navarra, Spain. The study in Spain, which is ongoing, has shown that people can start coughing more frequently without realizing it, potentially signifying that they have a disease. One 35-year-old woman’s hourly coughs tripled the night before she developed other Covid-19 symptoms and was diagnosed with the disease. Only two days after that did she perceive that she had a cough, the researchers found. Joe Brew, Hyfe’s chief executive, put his company’s app to work when his 3-year-old son, Galileo, was hospitalized with pneumonia in June. When doctors made their rounds, “they would invariably say, ‘how’s his cough?’” Mr. Brew recalls. He showed them his data from the app, which revealed that Galileo’s coughs had gone down over a few days from as many as 300 per hour to as many as 30 per hour, he says. “They loved it,” he says. Quiz answers: 1. A 5-year-old with asthma. 2. A 34-year-old with asthma. 3. A 55-year-old with chronic obstructive pulmonary disease. 4. A 45-year-old with Covid-19. 5. A 37-year-old with pneumonia. Appeared in the September 10, 2021, print edition as 'Get A Diagnosis Off Your Chest.' Copyright © 2022 Dow Jones & Company, Inc. All Rights Reserved This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers visit https://www.djreprints.com. AI And Machine Learning When Should — and Explain Why — How You Your AI Works by Reid Blackman and Beena Ammanath August 31, 2022 HBR Staff/barleyman/Getty Images Summary. AI adds value by identifying patterns so complex that they can defy human understanding. That can create a problem: AI can be a black box, which often renders us unable to answer crucial questions about its operations. That matters more in some cases than others. Companies need to understand what it means for AI to be “explainable” and when it’s important to be able to explain how an AI produced its outputs. In general, companies need explainability in AI when: 1) regulation requires it, 2) it’s important for understanding how to use the tool, 3) it could improve the system, and 4) it can help determine fairness. close “With the amount of data today, we know there is no way we as human beings can process it all…The only technique we know that can harvest insight from the data, is artificial intelligence,” IBM CEO Arvind Krishna recently told the Wall Street Journal. The insights to which Krishna is referring are patterns in the data that can help companies make predictions, whether that’s the likelihood of someone defaulting on a mortgage, the probability of developing diabetes within the next two years, or whether a job candidate is a good fit. More specifically, AI identifies mathematical patterns found in thousands of variables and the relations among those variables. These patterns can be so complex that they can defy human understanding. This can create a problem: While we understand the variables we put into the AI (mortgage applications, medical histories, resumes) and understand the outputs (approved for the loan, has diabetes, worthy of an interview), we might not understand what’s going on between the inputs and the outputs. The AI can be a “black box,” which often renders us unable to answer crucial questions about the operations of the “machine”: Is it making reliable predictions? Is it making those predictions on solid or justified grounds? Will we know how to fix it if it breaks? Or more generally: can we trust a tool whose operations we don’t understand, particularly when the stakes are high? To the minds of many, the need to answer these questions leads to the demand for explainable AI: in short, AI whose predictions we can explain. What Makes an Explanation Good? A good explanation should be intelligible to its intended audience, and it should be useful, in the sense that it helps that audience achieve their goals. When it comes to explainable AI, there are a variety of stakeholders that might need to understand how an AI made a decision: regulators, end-users, data scientists, executives charged with protecting the organization’s brand, and impacted consumers, to name a few. All of these groups have different skill sets, knowledge, and goals — an average citizen wouldn’t likely understand a report intended for data scientists. So, what counts as a good explanation depends on which stakeholders it’s aimed at. Different audiences often require different explanations. For instance, a consumer turned down by a bank for a mortgage would likely want to understand why they were denied so they can make changes in their lives in order to get a better decision next time. A doctor would want to understand why the prediction about the patient’s illness was generated so they can determine whether the AI notices a pattern they do not or if the AI might be mistaken. Executives would want explanations that put them in a position to understand the ethical and reputational risks associated with the AI so they can create appropriate risk mitigation strategies or decide to make changes to their go to market strategy. Tailoring an explanation to the audience and case at hand is easier said than done, however. It typically involves hard tradeoffs between accuracy and explainability. In general, reducing the complexity of the patterns an AI identifies makes it easier to understand how it produces the outputs it does. But, all else being equal, turning down the complexity can also mean turning down the accuracy — and thus the utility — of the AI. While data scientists have tools that offer insights into how different variables may be shaping outputs, these only offer a best guess as to what’s going on inside the model, and are generally too technical for consumers, citizens, regulators, and executives to use them in making decisions. Organizations should resolve this tension, or at least address it, in their approach to AI, including in their policies, design, and development of models they design in-hour or procure from third-party vendors. To do this, they should pay close attention to when explainability is a need to have versus a nice to have versus completely unnecessary. When We Need Explainability Attempting to explain how an AI creates its outputs takes time and resources; it isn’t free. This means it’s worthwhile to assess whether explainable outputs are needed in the first place for any particular use case. For instance, image recognition AI may be used to help clients tag photos of their dogs when they upload their photos to the cloud. In that case, accuracy may matter a great deal, but exactly how the model does it may not matter so much. Or take an AI that predicts when the shipment of screws will arrive at the toy factory; there may be no great need for explainability there. More generally, a good rule of thumb is that explainability is probably not a need-to-have when low risk predictions are made about entities that aren’t people. (There are exceptions, however, as when optimizing routes for the subway leads to giving greater access to that resource to some subpopulations than others). The corollary is that explainability may matter a great deal, especially when the outputs directly bear on how people are treated. There are at least four kinds of cases to consider in this regard. When regulatory compliance calls for it. Someone denied a loan or a mortgage deserves an explanation as to why they were denied. Not only do they deserve that explanation as a matter of respect — simply saying “no” to an applicant and then ignoring requests for an explanation is disrespectful — but it’s also required by regulations. Financial services companies, which already require explanations for their non-AI models, will plausibly have to extend that requirement to AI models, as current and pending regulations, particularly out of the European Union, indicate. When explainability is important so that end users can see how best to use the tool. We don’t need to know how the engine of a car works in order to drive it. But in some cases, knowing how a model works is imperative for its effective use. For instance, an AI that flags potential cases of fraud may be used by a fraud detection agent. If they do not know why the AI flagged the transaction, they won’t know where to begin their investigation, resulting in a highly inefficient process. On the other hand, if the AI not only flags transactions as warranting further investigation but also comes with an explanation as to why the transaction was flagged, then the agent can do their work more efficiently and effectively. When explainability could improve the system. In some cases, data scientists can improve the accuracy of their models against relevant benchmarks by making tweaks to how it’s trained or how it operates without having a deep understanding of how it works. This is the case with image recognition AI, for example. In other cases, knowing how the system works can help in debugging AI software and making other kinds of improvements. In those cases, devoting resources to explainability can be essential for the long-term business value of the model. When explainability can help assess fairness. Explainability comes, broadly, in two forms: global and local. Local explanations articulate why this particular input led to this particular output, for instance, why this particular person was denied a job interview. Global explanations articulate more generally how the model transforms inputs to outputs. Put differently, they articulate the rules of the model or the rules of the game. For example, people who have this kind of medical history with these kinds of blood test results get this kind of diagnosis. In a wide variety of cases, we need to ask whether the outputs are fair: should this person really have been denied an interview or did we unfairly assess the candidate? Even more importantly, when we’re asking someone to play by the rules of the hiring/mortgage lending/ad-receiving game, we need to assess whether the rules of the game are fair, reasonable, and generally ethically acceptable. Explanations, especially of the global variety, are thus important when we want or need to ethically assess the rules of the game; explanations enable us to see whether the rules are justified. Building an Explainability Framework Explainability matters in some cases and not in others, and when it does matter, it may matter for a variety of reasons. What’s more, operational sensitivity to such matters can be crucial for the efficient, effective, and ethical design and deployment of AI. Organizations should thus create a framework that addresses the risks of black boxes to their industry and their organizations in particular, enabling them to properly prioritize explainability in each of their AI projects. That framework would not only enable data scientists to build models that work well, but also empower executives to make wise decisions about what should be designed and when systems are sufficiently trustworthy to deploy. RB Reid Blackman is the author of Ethical Machines: Your Concise Guide to Totally Unbiased, Transparent, and Respectful AI (Harvard Business Review Press, July 2022) and founder and CEO of Virtue, an ethical risk consultancy. He is also a senior adviser to the Deloitte AI Institute, previously served on Ernst & Young’s AI Advisory Board, and volunteers as the chief ethics officer to the nonprofit Government Blockchain Association. Previously, Reid was a professor of philosophy at Colgate University and the University of North Carolina, Chapel Hill. BA Beena Ammanath is the Executive Director of the global Deloitte AI Institute, author of the book “Trustworthy AI,” founder of the non- profit Humans For AI, and also leads Trustworthy and Ethical Tech for Deloitte. She is an award-winning senior executive with extensive global experience in AI and digital transformation, spanning across e-commerce, finance, marketing, telecom, retail, software products, services and industrial domains with companies such as HPE, GE, Thomson Reuters, British Telecom, Bank of America, and e*trade. AI And Machine Learning ARight Framework Generative for Picking AI the Project by Marc Zao-Sanders and Marc Ramos March 29, 2023 Illustration by Mark Harris Summary. Generative AI has captured the public’s imagination. It is able to produce first drafts and generate ideas virtually instantaneously, but it can also struggle with accuracy and other ethical problems. How should companies navigate the risks in their pursuit of its rewards? In picking use cases, they need to balance risk (how likely and how damaging is the possibility of untruths and inaccuracies being generated and disseminated?) and demand (what is the real and sustainable need for this kind of output, beyond the current buzz?). The authors suggest using a 2×2 matrix to identify the use cases with the lowest risk and highest demand. close Over the past few months, there has been a huge amount of hype and speculation about the implications of large language models (LLMs) such as OpenAI’s ChatGPT, Google’s Bard, Anthropic’s Claude, Meta’s LLaMA, and, most recently, GPT4. ChatGPT, in particular, reached 100 million users in two months, making it the fastest growing consumer application of all time. It isn’t clear yet just what kind of impact LLMs will have, and opinions vary hugely. Many experts argue that LLMs will have little impact at all (early academic research suggests that the capability of LLMs is restricted to formal linguistic competence) or that even a near-infinite volume of text-based training data is still severely limiting. Others, such as Ethan Mollick, argue the opposite: “The businesses that understand the significance of this change — and act on it first — will be at a considerable advantage.” What we do know now is that generative AI has captured the imagination of the wider public and that it is able to produce first drafts and generate ideas virtually instantaneously. We also know that it can struggle with accuracy. Despite the open questions about this new technology, companies are searching for ways to apply it — now. Is there a way to cut through the polarizing arguments, hype and hyperbole and think clearly about where the technology will hit home first? We believe there is. Risk and Demand On risk, how likely and how damaging is the possibility of untruths and inaccuracies being generated and disseminated? On demand, what is the real and sustainable need for this kind of output, beyond the current buzz? It’s useful to consider these variables together. Thinking of them in a 2×2 matrix provides a more nuanced, one-size-doesn’t-fit all analysis of what may be coming. Indeed, risks and demands do differ from across different industries and business activities. We have placed some common cross-industry use cases in the table below. Think about where your business function or industry might sit. For your use case, how much is the risk reduced by introducing a step for human validation? How much might that slow down the process and reduce the demand? See more HBR charts in Data & Visuals The top-left box — where the consequence of errors is relatively low and market demand is high — will inevitably develop faster and further. For these use cases, there is a ready-made incentive for companies to find solutions, and there are fewer hurdles for their success. We should expect to see a combination of raw, immediate utilization of the technology as well as third-party tools which leverage generative AI and its APIs for their particular domain. This is happening already in marketing, where several start-ups have found innovative ways to apply LLMs to generate content marketing copy and ideas, and achieved unicorn status. Marketing requires a lot of idea generation and iteration, messaging tailored to specific audiences, and the production of text-rich messages that can engage and influence audiences. In other words, there are clear uses and demonstrated demand. Importantly, there’s also a wealth of examples that can be used to guide an AI to match style and content. On the other hand, most marketing copy isn’t fact-heavy, and the facts that are important can be corrected in editing. Looking at the matrix, you can find that there are other opportunities that have received less attention. For instance, learning. Like marketing, creating content for learning — for our purposes, let’s use the example of internal corporate learning tools — requires a clear understanding of its audience’s interests, and engaging and effective text. There’s also likely content that can be used to guide a generative AI tool. Priming it with existing documentation, you can ask it to rewrite, synthesize, and update the materials you have to better speak to different audiences, or to make learning material more adaptable to different contexts. Generative AI’s capabilities could also allow learning materials to be delivered differently — woven into the flow of everyday work or replacing clunky FAQs, bulging knowledge centers and ticketing systems. (Microsoft, a 49% shareholder in OpenAI, is already working on this, with a series of announcements planned for this year.) The other uses in the high demand/low risk box above follow similar logic: They’re for tasks where people are often involved, and the risk of AI playing fast and loose with facts are low. Take the examples of asking the AI to review text: You can feed it a draft, give it some instructions (you want a more detailed version, a softer tone, a five-point summary, or suggestions of how to make the text more concise) and review its suggestions. As a second pair of eyes, the technology is ready to use right now. If you want ideas to feed a brainstorm — steps to take when hiring a modern, multi-media designer, or what to buy a four-year-old who likes trains for her birthday — generative AI will be a quick, reliable and safe bet, as those ideas are likely not in the final product. Filling in the 2×2 matrix above with tasks that are part of your company’s or team’s work can help draw similar parallels. By assessing risk and demand, and considering the shared elements of particular tasks, it can give you a useful starting point and help draw connections and see opportunities. It can also help you see where it doesn’t make sense to invest time and resources. The other three quadrants aren’t places where you should rush to find uses for generative AI tools. When demand is low, there’s little motivation for people to utilize or develop the technology. Producing haikus in the style of a Shakespearian pirate may make us laugh and drop our jaws today, but such party tricks will not keep our attention for very much longer. And in cases where there is demand but high risk, general trepidation and regulation will slow the pace of progress. Considering your own 2×2 matrix, you can put the uses listed there aside for the time being. Low Risk is Still Risk A mild cautionary note: Even in corporate learning where, as we have argued, the risk is low, there is risk. Generative AI is still vulnerable to bias and errors, just as humans are. If you assume the outputs of a generative AI system are good to go and immediately distribute them to your entire workforce, there is plenty of risk. Your ability to strike the right balance between speed and quality will be tested. So take the initial output as a first iteration. Improve on it with a more detailed prompt or two. And then tweak that output yourself, adding the real-world knowledge, nuance, even artistry and humor that, for a little while longer, only a human has. Marc Zao-Sanders is CEO and co-founder of filtered.com, which develops algorithmic technology to make sense of corporate skills and learning content. Find Marc on LinkedIn here. Marc Ramos is the Chief Learning Officer of Cornerstone, a leader in learning and talent management technologies. Marc’s career as a learning leader extends 25 years’ experience with Google, Microsoft, Accenture, Oracle and more insights can be found here. Recommended For You What Roles Could Generative AI Play on Your Team? PODCAST How Generative AI Changes Strategy How to Train Generative AI Using Your Company's Data Managing the Risks of Generative AI This copy is for your personal, non-commercial use only. Distribution and use of this material are governed by our Subscriber Agreement and by copyright law. For non-personal use or to order multiple copies, please contact Dow Jones Reprints at 1-800-843-0008 or visit www.djreprints.com. https://www.wsj.com/tech/ai/the-new-jobs-for-humans-in-the-ai-era-db7d8acd The New Jobs for Humans in the AI Era Artificial intelligence threatens some careers, but these opportunities are on the rise By Robert McMillan Follow , Bob Henderson Follow and Steven Rosenbush Follow Oct. 5, 2023 11:00 am ET When OpenAI unleashed its humanlike ChatGPT software on the world last year, one thing was clear: These AI systems are coming for our jobs. But don’t write off the humans just yet. More than a century ago, the advent of the automobile was bad news for stable hands, but good for mechanics. And AI is already creating new opportunities. Here are a few of them. In-House Large Language Model Developer Large language models such as OpenAI’s GPT and Google’s LaMDA are trained on massive amounts of data scraped from the internet to recognize, generate and predict language in sequences. For the finance industry, that makes them a bit like new college graduates: Not much use without more specialized instruction. In-house developers will change that by introducing the models to new word patterns that will equip them to better carry out functions such as summarizing a company’s 10-K annual report filing or guiding a client through a loan-application process. “What we currently are heading toward is some small number of companies developing these humongous models and then customers—financial institutions—taking those models and then training them better for their own purposes in-house,” says Eric Ghysels, a professor of economics and finance at the University of North Carolina, Chapel Hill. In-house developers will design the curriculum for this training by choosing new and often proprietary data to run through the models. They will habituate models to legalese and to the financial meaning of words such as “interest” and “derivative” by querying them and responding with constructive feedback on their answers. Finally, they will deliver the models in a user-friendly form to employees and clients. The challenge for financial institutions, says Ghysels, will be finding people qualified to do all this. PHOTO: BRIAN STAUFFER Reskiller As AI becomes capable of taking on more work that is now done by humans, people will need to more aggressively upgrade their skills to stay productive and employable. “Reskillers,” a new type of teacher, will help people stay one step ahead of the machines. As AI evolves, companies will put growing value on specialists who can guide such critical human development. “Teachers had it bad under the industrial revolution. Look at what they are paid,” says Stephen Messer, co-founder and chairperson of Collective[i], which has developed a foundation model that produces insights around revenue forecasting and growth. “Now, I think teachers are about to go through a revolution because of AI.” Reskillers will need to understand the talents that organizations require as technology marches ahead. “This puts an onus on employees and companies to stay relevant,” says Keith Peiris, co- founder and chief executive of Tome, a startup with a generative AI-native storytelling and presentation platform. “In the ‘old world,’ pre generative AI, maybe you needed 100 people to build a company…With AI, maybe you could build that company with 30 people.” New career-development arcs are already taking shape because of generative AI, according to Peiris, who is trained in nanotechnology engineering and is a veteran of Facebook and Instagram. “Sales professionals are learning web development and copywriting. Marketers are becoming more steeped in graphic design,” he says. “HR professionals are taking on “legal work” and becoming paralegals by using AI legal tools.” PHOTO: BRIAN STAUFFER AI Psychotherapist Financial firms relying on AI for prediction and decision-making will need people to divine the drivers of a model’s thinking. Unlike conventional software, the logic behind the output of applications such as OpenAI’s ChatGPT is typically opaque. That may be fine when they’re used to generate things like recipes and poems, says Dinesh Nirmal, senior vice president, products, IBM Software, but not if they’re relied upon for things such as assigning credit scores, optimizing investment portfolios and predicting liquidity balances. Business or enterprise AI, which serves firms and organizations, is all about “explainability,” says Nirmal. Customers will want to know why their loan application was rejected. Bank regulators will require some decisions to be explained. AI psychotherapists will evaluate a model’s upbringing, by scrutinizing its training data for errors and sources of bias. They may put AI models on the couch, by probing them with test questions. Companies such as IBM, Google and Microsoft are racing to release new tools that quantify and chart an AI’s thought processes, but like Rorschach tests they require people to interpret their outputs. Understanding an AI’s reasoning will only be half the job, says Alexey Surkov, partner and global head of model risk management at Deloitte & Touche. The other half will be signing off on a model’s mental fitness for the task at hand. “No matter how sophisticated the models and systems get,” says Surkov, “we as humans are ultimately responsible for the outcomes of the use of those systems.” “Psychotherapist” might be a stretch for a job title in some financial firms. Surkov suggests AI Risk Manager or Controller as alternatives. PHOTO: BRIAN STAUFFER Prompt Engineer How do you program an AI system like ChatGPT that can converse with you, much like a human? You talk to it. Or, more precisely, you hire a prompt engineer to do this. Prompt engineering is an emerging class of job that is nestled somewhere between programming and management. Instead of using complicated computer programming languages like Python or Java, prompt engineers will spell out their instructions to AI systems in plain English, creating new ways of harnessing the power of the underlying AI systems. This is what legal software maker Casetext’s new class of engineers do with its AI-based legal assistant called CoCounsel. Jake Heller, the company’s chief executive, says he’s hiring prompt engineers to build out CoCounsel’s abilities by instructing the AI chatbot, in maybe a thousand words or so, how to do various legal tasks. The language the prompt engineers use is precise and to the point, explaining how to review documents, summarize research or review and edit a contract. For example, a prompt engineer may start to outline instructions for a CoCounsel memo by indicating the level of expertise needed: “Your goal is that the memo will display the level of perception, nuance, and attention to detail one would expect from a federal appellate judge drafting a legal opinion.” The best prompt engineers are people who can give very clear instructions, but who also understand the principles of coding, Heller says. In other words, they’re often great technical managers. Except with prompt engineers, it’s not an employee that they’re managing, he says. “It’s an AI.” Write to Robert McMillan at [email protected], Bob Henderson at [email protected] and Steven Rosenbush at [email protected] More in The Future of Everything AI And Machine Learning How Model to Red Team a Gen AI by Andrew Burt January 04, 2024 Jakub and Jedrzej Krzyszkowski/Stocksy Summary. Red teaming, a structured testing effort to find flaws and vulnerabilities in an AI system, is an important means of discovering and managing the risks posed by generative AI. The core concept is trusted actors simulate how adversaries would attack any given system. The term was popularized during the Cold War when the U.S. Defense Department tasked “red teams” with acting as the Soviet adversary, while blue teams were tasked with acting as the United States or its allies. In this article, the author shares what his specialty law firm has discovered what works and what doesn’t in red teaming generative AI. close In recent months governments around the world have begun to converge around one solution to managing the risks of generative AI: red teaming. At the end of October, the Biden administration released its sweeping executive order on AI. Among its most important requirements is that certain high-risk generative AI models undergo “red teaming,” which it loosely defines as “a structured testing effort to find flaws and vulnerabilities in an AI system.” This came a few months after the administration hosted a formal AI red-teaming event that drew thousands of hackers. The focus on red teaming is a positive development. Red teaming is one of the most effective ways to discover and manage generative AI’s risks. There are, however, a number of major barriers to implementing red teaming in practice, including clarifying what actually constitutes a red team, standardizing what that team does while testing the model, and specifying how the findings are codified and disseminated once testing ends. Each model has a different attack surface, vulnerabilities, and deployment environments, meaning that no two red teaming efforts will be exactly alike. For that reason, consistent and transparent red teaming has become a central challenge in deploying generative AI, both for the vendors developing foundational models and the companies fine-tuning and putting those models to use. This article aims to address these barriers and to sum up my experience in red teaming a number of different generative AI systems. My law firm, Luminos.Law, which is jointly made up of lawyers and data scientists, is focused exclusively on managing AI risks. After being retained to red team some of the highest profile and widely adopted generative AI models, we’ve discovered what works and what doesn’t when red teaming generative AI. Here’s what we’ve learned. What is red teaming generative AI? Despite the growing enthusiasm over the activity, there is no clear consensus on what red teaming generative AI means in practice. This is despite the fact that some of the largest technology companies have begun to publicly embrace the method as a core component of creating trustworthy generative AI. The term itself was popularized during the Cold War and began to be formally integrated into war-planning efforts by the U.S. Defense Department. In simulation exercises, so-called red teams were tasked with acting as the Soviet adversary (hence the term “red”), while blue teams were tasked with acting as the United States or its allies. As information security efforts matured over the years, the cybersecurity community adopted the same language, applying the concept of red teaming to security testing for traditional software systems. Red teaming generative AI is much different from red teaming other software systems, including other kinds of AI. Unlike other AI systems, which are typically used to render a decision — such as whom to hire or what credit rating someone should have — generative AI systems produce content for their users. Any given user’s interaction with a generative AI system can create a huge volume of text, images, or audio. The harms generative AI systems create are, in many cases, different from other forms of AI in both scope and scale. Red teaming generative AI is specifically designed to generate harmful content that has no clear analogue in traditional software systems — from generating demeaning stereotypes and graphic images to flat out lying. Indeed, the harms red teams try to generate are more commonly associated with humans than with software. In practice, this means that the ways that red teams interact with generative AI systems itself are unique: They must focus on generating malicious prompts, or inputs into the model, in addition to tests using more traditional code in order to test the system’s ability to produce harmful or inappropriate behavior. There are all sorts of ways to generate these types of malicious prompts — from subtly changing the prompts to simply pressuring the model into generating problematic outputs. The list of ways to effectively attack generative AI is long and growing longer every day. Who should red team the AI? Just like the definition of red teaming itself, there is no clear consensus on how each red team should be constructed. For that reason, one of the first questions companies must address is whether the red team should be internal to the company or external. Companies, including Google, that have stood up their own AI red teams now advocate for internal red teams, in which employees with various types of expertise simulate attacks on the AI model. Others, like OpenAI, have embraced the concept of external red teaming, even going so far as to create an outside network to encourage external members to join. Determining how AI red teams should be constituted is one of the tasks the Biden administration has given to the heads of federal agencies, who are on the hook to answer the question next year in a forthcoming report. So what do we tell our clients? For starters, there is no one-size- fits-all approach to creating red teams for generative AI. Here are some general guidelines. Due to the sheer scale of the AI systems many companies are adopting, fully red teaming each one would be impossible. For that reason, the key to effective red teaming lies in triaging each system for risk. We tell our clients to assign different risk levels to different models — based, for example, on the likelihood of the harm occurring, the severity of the harm if it does occur, or the ability to rectify the harm once it is detected. (These are commonly accepted metrics of defining risk.) Different risk levels can then be used to guide the intensity of each red teaming effort: the size of the red team, for example, or the degree to which the system is tested, or even if it’s tested at all. Using this approach, lower-risk models should be subject to less- thorough testing. Other models might require internal testing but no review from outside experts, while the highest-risk systems typically require external red teams. External parties focused on AI red teaming generative AI are likely to have higher levels of red teaming expertise and therefore will be able to unearth more vulnerabilities. External reviews can demonstrate a reasonable standard of care and reduce liability as well by documenting that outside parties have signed off on the generative AI system. Degradation Objectives Understanding what harms red teams should target is extremely important. We select what we call “degradation objectives” to guide our efforts, and we start our red teaming by assessing which types of harmful model behavior will generate the greatest liability. Degradation objectives are so critical because unless they are clearly defined and mapped to the most significant liabilities each system poses, red teaming is almost always unsuccessful or at best incomplete. Indeed, without proper organization, red teaming is all too often conducted without a coordinated plan to generate specific harms, which leads to attacks on the system and no clear and actionable strategic takeaways. While this type of red teaming might create the appearance of comprehensive testing, disorganized probing of this kind can be counterproductive, creating the impression that the system has been fully tested when major gaps remain. Along with a clear assessment of risks and liabilities, it is also best practice to align degradation objectives with known incidents from similar generative AI systems. While there are a number of different ways to track and compare past incidents, the AI Incident Database is a great resource (and one that we rely heavily on). Here are a few common degradation objectives from our past red teaming efforts: Helping users engage in illicit activities Users can take advantage of generative AI systems to help conduct a range of harmful activities and, in many cases, generate significant liability for the companies deploying the AI system in the process. If sufficient safeguards against this type of model behavior are not in place, companies may end up sharing responsibility for the ultimate harm. In the past, we’ve tested for harms ranging from instructions for weapons and drug manufacturing to performance of fraudulent accounting to the model carrying out automated hacking campaigns. Bias in the model AI in general can generate or perpetuate all sorts of bias, as I’ve written about here before, which, in turn, can lead to many different types of liabilities under anti-discrimination law. The U.S. Federal Trade Commission has devoted a lot of attention to the issue of unfairness in AI over the past few years, as have lawmakers, signaling that more liability is coming in this area. Biases can arise in model output, such as unfairly representing different demographic groups in content generated by the AI, as well as in model performance itself, such as performing differently for members of different groups (native English speakers vs. non-native speakers, for example). Toxicity Toxicity in generative AI arises with the creation of offensive or inappropriate content. This issue has a long history in generative AI, such as when the Tay chatbot infamously began to publicly generate racist and sexist output. Because generative AI models are shaped by vast amounts of data scraped from the internet — a place not known for its decorum — toxic content plagues many generative AI systems. Indeed, toxicity is such an issue that it has given rise to a whole new field of study in AI research known as “detoxification.” Privacy harms There are a host of ways that generative AI models can create privacy harms. Sometimes personally identifying information is contained in the training data itself, which can be hacked by adversarial users. Other times, sensitive information from other users might be leaked by the model unintentionally, as occurred with the South Korean chatbot Lee Luda. Generative AI models might even directly violate company privacy policies, such as falsely telling users they have limited access to their data and thereby engaging in fraud. The list of degradation objectives is often long, ranging from the objectives outlined above to harms like intellectual property infringement, contractual violations, and much more. As generative AI systems are deployed in a growing number of environments, from health care to legal and finance, that list is likely to grow longer. Attacks on Generative AI Once we’ve determined the composition of the red team, the liabilities and associated degradation objectives to guide testing, the fun part begins: attacking the model. There are a wide variety of methods red teams can use. At Luminos.Law, we break our attack plans into two categories: manual and automated. We’ll largely focus on manual attacks here, but it’s worth noting that a large body of research and emerging tools make automated attacks an increasingly important part of red teaming. There are also many different open source datasets that can be used to test these systems. (Here is one paper that provides a general overview of many such datasets.) An effective attack strategy involves mapping each objective to the attacks we think are most likely to be successful, as well as the attack vectors through which we plan to test the system. Attack vectors may be “direct,” consisting of relatively short, direct interactions with the model, while others involve more complex attacks referred to as indirect prompt injection, in which malicious code or instructions might be contained in websites or other files the system may have access to. While the following list doesn’t include all the techniques we use, it does give a sample of how we like to approach attacks during red teaming: Code injection. We use computer code, or input prompts that resemble computer code, to get the model to generate harmful outputs. This method is one of our favorites precisely because it has a strikingly high success rate, as one group of researchers recently demonstrated. Content exhaustion. We use large volumes of information to overwhelm the model. Hypotheticals. We instruct the model to create output based on hypothetical instructions that would otherwise trigger content controls. Pros and cons. We ask about the pros and cons of controversial topics to generate harmful responses. Role-playing. We direct the model to assume the role of an entity typically associated with negative or controversial statements and then goad the model into creating harmful content. There are, of course, dozens of attack strategies for generative AI systems — many of which, in fact, have been around for years. Crowdsourcing attack methodology, where possible, is also a best practice when red teaming, and there are a number of different online resources red teamers can use for inspiration, such as specific Github repositories where testers refine and share successful attacks. The key to effective testing lies in mapping each strategy to the degradation objective, attack vector, and, of course, taking copious notes so that successful attacks can be captured and studied later. Putting It All Together Red teaming generative AI is complicated — usually involving different teams, competing timelines, and lots of different types of expertise. But the difficulties companies encounter are not just related to putting together the red team, aligning on key liabilities, coming up with clear degradation objectives, and implementing the right attack strategies. We see a handful of other issues that often trip companies up. Documentation Successful red teaming oftentimes involves testing hundreds of attack strategies. If automated attacks are used, that number can be in the thousands. With so many variables, testing strategies, red team members, and more, it can be difficult to keep track of the information that is generated and to ensure testing results are digestible. Having clear guidance not just on how to test but also on how to document each test is a critical if often-overlooked part of the red teaming process. While every organization and red team is different, we solved this issue for our law firm by creating our own custom templates to guide our testing and to present our final analysis to our clients. Knowing that the final documentation aligns with the information captured during real time testing makes the red teaming process significantly more efficient. Legal privilege With so much sensitive information being generated across testers and teams, understanding where and when to assert legal privilege is another often overlooked but a major consideration. We often see potential liabilities being discussed openly in places like Slack, which makes that information discoverable to adversarial parties if external oversight occurs, such as a regulatory investigation or lawsuit. The last thing companies want is to increase their risks because they were red teaming their models. Getting lawyers involved and thoughtfully determining where information about testing results can be communicated and how is a key consideration. What to do about vulnerabilities Having clear plans for addressing the vulnerabilities that red teaming efforts discover is another central but often-overlooked part of the red teaming process. Who from the product or data science teams is responsible for taking action? Do they meet with the red team directly or through an intermediary? Do they attempt to patch the vulnerabilities as red teaming is occurring or should they wait until the end of the process? These questions, and many more, need to be addressed before red teaming occurs; otherwise the detection of vulnerabilities in the model is likely to create lots of confusion. This article provides only a high-level overview of all of the considerations that go into making red teaming generative AI successful. It is one of the most effective ways to manage the technology’s complex risks, and for that reason, governments are just beginning to realize red teaming’s benefits. Companies betting big on generative AI should be equally committed to red teaming. Andrew Burt is the managing partner of Luminos.Law, a boutique law firm focused on AI and analytics, and a visiting fellow at Yale Law School’s Information Society Project.