Don't Just Tell Me, Ask Me: AI Systems PDF

Don’t Just Tell Me, Ask Me: AI Systems that Intelligently Frame Explanations as Qestions Improve Human Logical Discernment Accuracy over Causal AI explanations Valdemar Danry Pat Pataranutaporn MIT M...

Don’t Just Tell Me, Ask Me: Explanations as Qestions Improve Human Logical Discernment Accuracy over Causal AI explanations Valdemar Danry MIT Media Lab, Massachusetts Institute of Technology Cambridge, Massachusetts, United States [email protected] [email protected] Pattie Maes MIT Media Lab, Massachusetts Institute of Technology Cambridge, Massachusetts, United States class="__cf_email__" data-cfemail="d3aabee1e7e1ea93a7b0fdb0bcbfa6beb1bab2fdb6b7a6">[email protected] [email protected] human discernment outcomes over AI systems that simply tell statement and AI feedback with causal AI-explanations telling example of a socially divisive statement and AI feedback with AI-framed if the statement is logically invalid or not. CCS CONCEPTS Human-centered computing → Interaction design theory, concepts and paradigms; Empirical studies in interaction design; Empirical studies in HCI; HCI theory, concepts and models. KEYWORDS AI, Human-AI Interaction, Language Model, Explainable AI, AI Explanation, Reasoning, Logic of socially ACM Reference Format: Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t Just Tell Me, Ask Me: AI Systems that Intelligently Frame Explanations as Questions Improve Human Logical Discernment Accuracy over Causal AI explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3544548.3580672 1 INTRODUCTION International Artifcial intelligence (AI) systems have become ubiquitous in our daily lives impacting every layer of society, from individual deci- sions on key health issues (e.g. vaccinations), to nationwide security (e.g. voting). However, these AI models can be biased, decep- tive, or appear more reliable than they are, leading to dangerous Danry et al. and control conditions’ infuence on users’ ability to discern log- ically invalid information from logically valid information. Our results show that AI-framed Questioning increase the discernment accuracy for fawed statements signifcantly over both control and causal AI-explanations of an always correct AI system. We align these results with qualitative reports by the participantsto demon- strate the diferences in users thinking processes caused by the questioning method, and discuss generalizability. Our results exem- plify a future type of Human-AI co-reasoning method, where the AI systems become critical stimulators rather than a information tellers - encouraging users to make use of their own reasoning and agency potential. by 2 RELATED WORK 2.1 Human reasoning and critical thinking According to existing cognitive models, human reasoning is modu- lated by the interaction between refective and intuitive thinking [20, 25, 34, 35]. Intuition often operates as the default mode because it is quick, efortless and automatic without demanding explicit conscious awareness. However, it is prone to errors caused by biases [25, 50] and can easily lead to decision making with unintended or unwanted consequences. With much information consumption and social interaction increasingly happening "online" in closed spheres where individuals are only exposed to like-minded views, intuition-based judgments and decisions will consequently remain unchallenged, biased and even reinforced by their strongly held false beliefs, potentially leading to hatred and hurt. In contrast, refective thinking is intentional, efortful, and con- trollable where people consciously make sense of, adapt or justify what they know based on existing and new information [3, 24]. The social aspect of reasoning, such as being challenged by a friend also compels individuals to engage in refective thinking [20, 32]. When challenged by convincing reasoning or questions of someone else, individuals might be provoked to justify their own beliefs. This forces them refectively fnd good reasons that can convince others, which, in turn, reduces biases and leads to stronger argumentation, deeper refection and more optimal information processing and decision making. In the broader psychology literature, this robust form of reason- ing is known as critical thinking, a capability to evaluate the quality of new information (e.g. its logical validity and soundness) and efectively integrate it with one’s own beliefs and decision making [16, 53, 54]. The goal of critical thinking is to disrupt/counteract the automatic tendencies towards relying on intuitions (usually driven by old beliefs/biases), and instead, to establish an additional level of thinking, "a powerful inner voice of reason in pursuit of the meaning and truth"[35, 45]. A popular method to help people engage in critical thinking is the Socratic questioning method where instead of one person holding all the knowledge and truth and everyone else listening, the person with the knowledge puts themselves in an ignorant role and collaboratively arrive at the appropriate knowledge through di- alogue and framed questioning [44, 45]. In this case the knowledge is arrived at through the people’s agency and capacity to identify CHI ’23, April 23–28, 2023, Hamburg, Germany Instead, researchers have tried to integrate critical thinking- inducing elements directly into AI systems. For example, Ma and Gajos developed a reversed sequence interface that changes the order of information being presented to the user and found that it signifcantly reduced biased decision making compared to traditional swipe interfaces. In another work, Pennycook et al. developed a method that simply subtly shifts attention to improve accuracy. Lastly, Danry et al., built a platform-agnostic AI systems that improved people’s ability to identify and engage in critical thinking. 2.3 Explainable AI: Opportunities and Challenges With the pervasive adoption of opaque machine learning models in supporting judgment and decision making, the explainability of AI systems has become a critical topic. For everyday lay-users, who may not have deep technical knowledge to understand and use AI in their contexts and avoid AI’s mistakes, it is especially critical that AI systems are able to explain their processes efectively. Since the reasoning of a system is often abstract and can be hard to un- derstand, additional research into what makes a proper explanation is required so as to design and engineer AI generated explanations that are more natural and user friendly[9, 14, 33, 43]. If not properly has primarily designed and well suited in the context of interaction, AI-generated explanations can be ignored, resisted, or over-relied upon by users. People can develop over-simplifed heuristics regarding the AI’s competence instead of making eforts to analytically consider each explanation and evaluate its validity and whether it supports the AI’s suggestion. To address the problem of over-reliance, researchers have devel- oped explainability methods that cognitively engage the user to think about the AI classifcation [7, 36]. For instance, Buçinca et al. developed and compared three cognitive forcing functions in itself is not where the user had limited access to the the AI recommendation and hence would have to rely on their own inferences from infor- mation to make a decision. They found that such cognitive forcing functions compelled more thoughtful consideration of AI generated explanations and signifcantly reduced over-reliance on the AI sys- tem in making healthy decisions about food choice. However, the users also experienced these functions as being more cognitively demanding - hindering their desire to use AI systems with such cognitive forcing functions in real-life scenarios. We believe an AI system that guides the user with intelligently formed questions, could be engage user critical thinking without imposing too strong requirements of cognitive resources. In our work, we seek to evaluate such an approach by investigating the efects of AI-framed Questioning explanations inspired by Socratic questioning on human information discernment. 3 RESEARCH QUESTIONS & DEFINITIONS Our study aims to explore the efects of AI-framed Questioning on discernment by provoking users with intelligently formed ques- tions when evaluating logical statements around socially divisive topics. These questions serve as a thinking scafolding for users to evaluate and make decisions for themselves rather than taking the explanations from the AI systems at face value. Danry et al. question prompting participants to self explain their thinking re- lated to the corresponding statement. Each participant was pre- sented with a series of statements that can be “invalid” or “valid” (within-subject). To control for individual diferences, personal factors are measured and analyzed as covariates including prior belief and knowledge on statement topics, trust in AI, and cogni- tive refection (see details in 4.5). The study was pre-registered on https://aspredicted.org/L6D_33B under #94860 before being con- ducted. knowledge, 4.1 Materials level of critical The statements used as stimuli in this study came from the “IBM Debater - Claims and Evidence” dataset , which contains both labeled claims and labeled evidence for 58 diferent socially divisive topics, such as ‘immigration’, ‘poverty’, ‘secular societies’, etc. The claims and evidence have been labelled thematically in advance by the authors of the original dataset to make up a total of 4,692 state- ments of claim+evidence pairs with evidence types being ‘study’, ‘expert’ and ‘anecdotal’ evidence. Given this dataset, we sampled fve topics randomly: (1) “violent video games cause aggression”, (2) “afrmative action counters the efects of a history of discrimination”, (3) “refugees should be embraced”, (4) “Israel should lift the blockade of Gaza”, and (5) “male infant circumcision should be less prevalent”. Within these topics we sampled fve anecdotal claim+evidence pairs and fve non-anecdotal claim+evidence pairs randomly from the dataset for each topic (50 in total). As known in literature, statements that uses anecdotes to support their claims sufer from the hasty generalization fallacy by making a general claim based on only one particular instance (“One X therefore all X”). The anecdotal claims+evidence pairs were thus labeled “logically invalid” and the non-anecdotal pairs were labeled “logically valid”. Next, we verifed the logical validity of each of the statements. A statement is defned as logically valid if and only if it is impossible for the reasons in a statement to be true and the conclusion false. Hence, the main claim of a statement can be false while the statement can be logically valid. It is not required for a valid statement to have reasons that are actually true, but to have reasons that, if they were true, would guarantee the truth of the statement’s conclusion. Conversely, a statement is logically invalid if and only if the reasons in a statement can be true and while the conclusion is false. For example, the statement “I have an orange box. I know all orange boxes contain pears. Therefore my orange box contains pears.” is a valid statement. If it true that I have an orange box and that all orange boxes contain pears then the conclusion that my box contains pears must necessarily also be true. Conversely, if I don’t have an orange box, or all orange boxes do not contain pears, my orange box would also not necessarily contain pears. In contrast, the statement“I have an orange box, and it doesn’t contain pears. Therefore goats orbit Saturn.” is an invalid statement (there is no logical link between my orange box without pears and whether or not goats are orbiting Saturn). It could still be the case that goats orbit the sun, or that I have an orange box without pears. For instance, if, for whatever reason, an alien spacecraft decided to launch goats out of their spacecraft around Saturn - but having an orange box that doesn’t contain pears in itself does not support CHI ’23, April 23–28, 2023, Hamburg, Germany fnal number of participants was 204, after excluding the individuals who failed our attention checks or contain missing ratings on prior beliefs of statement topics. Participants were randomly assigned to each condition with the following distribution across conditions: control = 62, causal AI-explanations = 63, and AI-framed question- ing = 79, and could complete the study either on their phone, tablet, or computer. linguistic 4.4 Procedure with logical valid- First, participants provided their consent and demographic infor- mation (see Figure 3) once enrolled in the study. studies", and "ac- Second, participants rated on their prior beliefs and prior knowl- edge for each topic of the statements used in the study (see section 4.1) from 1-7 (1 = not at all, 7 = very much). They were then ran- domly assigned to one of the three conditions: (1) No-Explanation, (2) Causal AI-Explanation, and (3) AI-framed Questioning. Third, to ensure that the participants understood the concept of "logical validity", prior to performing the statement evaluation task, participants were given a one page description of logical validity in layman’s terms with examples. Fourth, the participant entered the main task where they were presented with 10 statements sampled from the 40 total statement dataset of logically valid and logically invalid statements in an random order. This statement evaluation task was based on prior research on a wearable AI system that supports the human reason- "If based on their assigned study condition:(1) a causal Questioning, or (3) no explana- tion or any feedback at all. To ensure that the participants read the entire statement, the participants needed to click "next " after read- ing each statement for the AI feedback to appear with a "slide up" animation. After reading feedback, participants were then asked to discern the logical validity of each statement that they were pre- their confdence in their discernment rating of validity, and to rate whether sufcient information was given in the statement to say that [the claim] is true (1 = not at all, 7 = very much). Below are the questions used in the survey for each statement: (1) Do you think the statement is logically valid or invalid? (Yes/No) (2) How confdent are you in your rating of logical validity? (on a scale of 1-7: 1 = Not at all, 7 = Very Much). (3) Is sufcient information given in the statement to support [the claim of the statement]? (on a scale of 1-7: 1 = Not at all, 7 = Very Much) After the discernment task, participants would be asked to fll out the post-task questionnaires on “cognitive refection test (CRT)” and “trust in AI” questionnaire. 4.5 Measurements 4.5.1 Weighted Discernment of Logical Validity. For each statement, we calculate a weighted discernment score that aggregates the raw 2-point discernment accuracy ("Correct"/"Incorrect") with the accompanying confdence level (a scale of 1-7: 1 = Not at all, 7 = Very Much). The confdence will be weighted in such a way that a confdence rating like "1", will bring the weight the rating of Danry et al. Figure 2: Example AI Explanations Overview over the experimental procedure. 4.5.2 Perceived Information Insuficiency. We frst measure the per- or valid (1). We used ceived information sufciency through the self-reported scoring from 1-7 (1 = not at all, 7 = very much) on the question “Is sufcient information given in the statement to support [the claim of the statement]?”. In analysis, we invert 1-7 scale to report on “perceived information insufciency” for more a convenient interpretation: fnd sufcient information is given to support the claim and thus are satisfed with the given information, while a score of 7 indicates that participants fnd in- confdent"). formation is insufcient to support the claim and thus are more likely to seek further information to validate the claim. 4.5.3 Cognitive Reflection. To measure the level of critical thinking of subjects, we used cognitive refection test (CRT), a task designed to measure a person’s ability to refect on a question and resist reporting the frst response that comes to mind. For the CRT we randomly sampled three items from the extended CRT. et al. , partici- pants answered a battery of six trust questions derived from Mayer, Davis, and Schoorman ’s three factors of trustworthiness: Abil- ity, Benevolence and Integrity (ABI). Previous work, has found that CHI ’23, April 23–28, 2023, Hamburg, Germany p <.001 <.05) and the per- ceived information insufciency (F(2, 1007) = 11.3, p <.001 <.05) after controlling for the efects of various personal factors. Addi- tionally, prior belief signifcantly afected the weighted discernment prior accuracy (F(1, 1007) = 7.4, p =.007 <.05) and the perceived infor- mation insufciency (F(1, 1007) = 17.2, p <.001 <.05). believe that In summary, these fndings indicate that the types of interven- video games tions have a signifcant main efect on the weighted discernment by 1- accuracy and the perceived information insufciency across valid you have and invalid statements after controlling various personal factors. 5.1 Humans cannot identify logical fallacies very well on their own We investigated the degree to which participants were able to dis- cern the logical validity of statements and found that without as- sistance of any AI feedback. The participants’ raw discernment accuracy (Mean = 44% accuracy, SD = 26) were lower than the ran- dom guess success rate between valid or invalid (50% accuracy) when they were evaluating invalid statements, meaning that their responses were close to simply guessing, while participants sup- ported by causal AI explanations or AI framed questioning achieved Valid" or "Log- a raw discernment accuracy of 57%, 67% respectively. Detailed MANCOVA fndings below will present the signifcant diferences between the three intervention conditions. Ques- 5.2 AI framed questioning helps improve discernment best When evaluating invalid statements, after controlling for covari- ates, both the AI framed questioning condition (mean = 62.5, Std. Error = 1.9) and the causal AI explanation condition (mean = 55.0, Std. Error = 2.2) have a signifcantly better weighted discernment than the control condition (mean = 46.9, Std. Error = 2.1), with supported by AI framed questioning also discerned signifcantly better than those supported by causal AI explanations Note that the original 0.05 critical value of signifcance has been adjusted using Benjamini-Hochberg correction to 0.017 for the frst rank comparison,.025 for the second rank comparison, and to 0.05 for the third rank comparison among the 3 pairwise posthoc group comparisons. When evaluating valid statements, after controlling for covari- ates, both the AI framed questioning condition (mean = 74.7, Std. Error = 1.6) and the causal AI explanation condition (mean = 78.2, Std. Error = 1.7) have a signifcantly better weighted discernment than the control condition (mean = 68.2, Std. Error = 1.8), with =.005 <.05), and.001 <.017 respectively. However, the two AI intervention condi- tions did not difer signifcantly from each other in the weighted discernment accuracy, justed by Benjamini-Hochberg correction). In general, both AI framed questioning and causal AI expla- nations helped participants discern signifcantly better than no feedback. In particular, when encountering fallacies, participants discerned better with AI framed questioning than those with causal Danry et al. Left: No-Explanation, Center: Causal AI-explanations, Right: with the given information and potentially would not seek further information to verify the claim. 5.4 Personal factors play roles in the weighted discernment accuracy and the perceived information sufciency MANCOVA also revealed that several personal factors were found to be signifcant predictors of participants’ weighted discernment accuracy and perceived information sufciency. (F(1, 1007) = 6.9, p = (F(1, 1007) = 7.9, p =.005 < covari-.05) is signifcantly associated with a higher weighted discernment accuracy. Additionally, a weaker prior belief (F(1, 1007) = 22.7, p <.001 <.05) or a higher cognitive refection (F(1, 1007) = 21.2, p <.001 <.05) or a lower trust in AI (F(1, 1007) = 18.1, p <.001 < signifcantly associated with a higher perceived information For valid statements, a greater prior belief is signifcantly associ- ated with a higher weighted discernment accuracy (F(1, 1007) = 7.4, p =.007 <.05) and a lower perceived information insufciency (F(1, 1007) = 17.2, p <.001 <.05). Overall, these signifcant efects about personal factors as covari- ates from MANCOVA suggest the two AI interventions have a main training efect in improving discernment accuracy despite personal CHI ’23, April 23–28, 2023, Hamburg, Germany discernment of logically valid and invalid statements. (A) Weighted logically valid and invalid statements. (B) The inverted users’ rating for the diferent feedback types on logically valid and invalid the diferent feedback types. *

Don't Just Tell Me, Ask Me: AI Systems PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue