Don't Just Tell Me, Ask Me: AI Systems PDF
Document Details
Uploaded by HealthfulSymbolism
DePauw University
2023
Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, Pattie Maes
Tags
Summary
This paper explores the use of AI-framed questioning to improve human discernment of logical validity, focusing on socially divisive statements. The researchers implemented a study with 204 participants comparing this method, causal AI explanations, and a control group. The results show that AI-framed questioning significantly increases human discernment accuracy compared to the other conditions.
Full Transcript
Don’t Just Tell Me, Ask Me: AI Systems that Intelligently Frame Explanations as Qestions Improve Human Logical Discernment Accuracy over Causal AI explanations Valdemar Danry Pat Pataranutaporn MIT M...
Don’t Just Tell Me, Ask Me: AI Systems that Intelligently Frame Explanations as Qestions Improve Human Logical Discernment Accuracy over Causal AI explanations Valdemar Danry Pat Pataranutaporn MIT Media Lab, Massachusetts Institute of Technology MIT Media Lab, Massachusetts Institute of Technology Cambridge, Massachusetts, United States Cambridge, Massachusetts, United States [email protected] [email protected] Yaoli Mao Pattie Maes Columbia University MIT Media Lab, Massachusetts Institute of Technology New York City, New York, United States Cambridge, Massachusetts, United States [email protected] [email protected] Figure 1: AI systems that ask a user questions can improve human discernment outcomes over AI systems that simply tell people what and why. Left: An example of a socially divisive statement and AI feedback with causal AI-explanations telling users why the statement is logically invalid. Right: An example of a socially divisive statement and AI feedback with AI-framed Questioning asking the user a question that helps them asses if the statement is logically invalid or not. ABSTRACT CCS CONCEPTS Critical thinking is an essential human skill. Despite the importance Human-centered computing → Interaction design theory, of critical thinking, research reveals that our reasoning ability suf- concepts and paradigms; Empirical studies in interaction fers from personal biases and cognitive resource limitations, leading design; Empirical studies in HCI; HCI theory, concepts and to potentially dangerous outcomes. This paper presents the novel models. idea of AI-framed Questioning that turns information relevant to the AI classifcation into questions to actively engage users’ think- KEYWORDS ing and scafold their reasoning process. We conducted a study with AI, Human-AI Interaction, Language Model, Explainable AI, AI 204 participants comparing the efects of AI-framed Questioning on Explanation, Reasoning, Logic a critical thinking task; discernment of logical validity of socially divisive statements. Our results show that compared to no feed- ACM Reference Format: back and even causal AI explanations of an always correct system, Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t Just Tell Me, Ask Me: AI Systems that Intelligently Frame Explanations AI-framed Questioning signifcantly increase human discernment as Questions Improve Human Logical Discernment Accuracy over Causal of logically fawed statements. Our experiment exemplifes a future AI explanations. In Proceedings of the 2023 CHI Conference on Human Factors style of Human-AI co-reasoning system, where the AI becomes a in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, critical thinking stimulator rather than an information teller. New York, NY, USA, 13 pages. https://doi.org/10.1145/3544548.3580672 1 INTRODUCTION This work is licensed under a Creative Commons Attribution International 4.0 License. Artifcial intelligence (AI) systems have become ubiquitous in our daily lives impacting every layer of society, from individual deci- CHI ’23, April 23–28, 2023, Hamburg, Germany sions on key health issues (e.g. vaccinations), to nationwide security © 2023 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9421-5/23/04. (e.g. voting). However, these AI models can be biased, decep- https://doi.org/10.1145/3544548.3580672 tive, or appear more reliable than they are, leading to dangerous CHI ’23, April 23–28, 2023, Hamburg, Germany Danry et al. decision-making outcomes. For example, an AI system might and control conditions’ infuence on users’ ability to discern log- misinterpret a medical diagnosis due to bias in the data used to ically invalid information from logically valid information. Our train it leading to unnecessary and harmful treatments, a predic- results show that AI-framed Questioning increase the discernment tive policing system might use faulty data to erroneously target accuracy for fawed statements signifcantly over both control and a member of a minority group which can lead to over-policing of causal AI-explanations of an always correct AI system. We align marginalized communities, or a social media information recom- these results with qualitative reports by the participantsto demon- mendation system might falsely claim incorrect or misleading news strate the diferences in users thinking processes caused by the stories to be true which can lead to political instability and social questioning method, and discuss generalizability. Our results exem- unrest [17, 18, 46]. plify a future type of Human-AI co-reasoning method, where the This is especially concerning when AI systems are used in con- AI systems become critical stimulators rather than a information junction with humans, as people have a tendency to blindly follow tellers - encouraging users to make use of their own reasoning and the AI decisions and stop using their own cognitive resources to agency potential. think critically.Prior research shows that people when assisted by AI systems that give them the answers become compliant and stop using cognitive resources to engage in thinking critically about AI-mediated information themselves and take the feedback of the 2 RELATED WORK AI at face value [7, 11, 17]. Critical thinking is the ability to logically assess claims, reasons, 2.1 Human reasoning and critical thinking and beliefs and hold back from making a decision when information According to existing cognitive models, human reasoning is modu- is not sufcient or misleading. In the literature, critical thinking lated by the interaction between refective and intuitive thinking has been positively correlated with an increased ability to discern [20, 25, 34, 35]. Intuition often operates as the default mode because false information from true information [35, 45, 48] and override it is quick, efortless and automatic without demanding explicit otherwise intuitive thoughts[13, 25, 34, 35]. As we interact more conscious awareness. However, it is prone to errors caused and more with AI systems in everyday life, we are increasingly by biases [25, 50] and can easily lead to decision making with exposed to massive amounts of AI-mediated information that can unintended or unwanted consequences. With much information be potentially deceptive, misleading, or strictly false. In such cases consumption and social interaction increasingly happening "online" critical thinking becomes an important skill to reliably process the in closed spheres where individuals are only exposed to like-minded information we encounter and integrate it with our existing beliefs views, intuition-based judgments and decisions will consequently and behaviors. However, despite the importance of critical thinking, remain unchallenged, biased and even reinforced by their strongly it is a skill not everyone masters or have the cognitive resources to held false beliefs, potentially leading to hatred and hurt. engage in. Researchers have shown that people frequently use “cog- In contrast, refective thinking is intentional, efortful, and con- nitive shortcuts” that do not involve critical, logical and cautious trollable where people consciously make sense of, adapt or justify thinking [15, 24, 30, 47, 51, 56]. what they know based on existing and new information [3, 24]. The While previous work has shown that local explanations with social aspect of reasoning, such as being challenged by a friend also the causal explanations strategy (i.e. "X classifcation because of compels individuals to engage in refective thinking [20, 32]. When Y reason") can help people determine the veracity of information, challenged by convincing reasoning or questions of someone else, change people’s beliefs and improve their decision making out- individuals might be provoked to justify their own beliefs. This comes [11, 28], using causal AI-explanations does not necessarily forces them refectively fnd good reasons that can convince others, improve the human reasoning process, as users might just rely on which, in turn, reduces biases and leads to stronger argumentation, the answers of the AI systems without thinking about the problem deeper refection and more optimal information processing and for themselves [7, 11, 17]. Complete over-reliance on AI-systems is decision making. problematic as (1) it makes users vulnerable to mistakes made by In the broader psychology literature, this robust form of reason- the AI system, and (2) users do not learn how to internalize the skill. ing is known as critical thinking, a capability to evaluate the quality Going beyond this challenge and building AI systems that engage of new information (e.g. its logical validity and soundness) and the user more deeply to reason for themselves requires development efectively integrate it with one’s own beliefs and decision making of new human-AI interaction and explanation methods. [16, 53, 54]. The goal of critical thinking is to disrupt/counteract This paper presents the novel idea of AI-framed Questioning the automatic tendencies towards relying on intuitions (usually inspired by the ancient method of Socratic questioning that uses in- driven by old beliefs/biases), and instead, to establish an additional telligently formed questions to provoke human reasoning, allowing level of thinking, "a powerful inner voice of reason in pursuit of the user to correctly discern the logical validity of the informa- the meaning and truth"[35, 45]. tion for themselves. In contrast to causal AI-explanations that are A popular method to help people engage in critical thinking declarative and have users passively receiving feedback from AI is the Socratic questioning method where instead of one person systems, our AI-framed Questioning method provides users with a holding all the knowledge and truth and everyone else listening, more neutral scafolding that leads users to actively think critically the person with the knowledge puts themselves in an ignorant role about information. We report on an experiment with 210 partici- and collaboratively arrive at the appropriate knowledge through di- pants comparing causal AI-explanations, AI-framed Questioning, alogue and framed questioning [44, 45]. In this case the knowledge is arrived at through the people’s agency and capacity to identify Don’t Just Tell Me, Ask Me CHI ’23, April 23–28, 2023, Hamburg, Germany contradictions, correct incomplete or inaccurate ideas and eventu- Instead, researchers have tried to integrate critical thinking- ally discover the fullest possible knowledge, rather than passively inducing elements directly into AI systems. For example, Ma and relying on the person with knowledge. Gajos developed a reversed sequence interface that changes Inspired by this method, AI systems could promote users’ inter- the order of information being presented to the user and found nal critical thinking by asking them framed questions that assist that it signifcantly reduced biased decision making compared to them at arriving at appropriate knowledge instead of requiring the traditional swipe interfaces. In another work, Pennycook et al. user to rely on the AI system. developed a method that simply subtly shifts attention to improve accuracy. Lastly, Danry et al., built a platform-agnostic AI systems that improved people’s ability to identify and engage in critical 2.2 Technologies for supporting critical thinking. thinking Conversational systems like Alexa, Siri and Google Assistant are 2.3 Explainable AI: Opportunities and playing an increasingly important role in everyday learning and Challenges information processing. But while these conversational AI systems are good at extracting relevant information, they have no With the pervasive adoption of opaque machine learning models in methods for helping the user evaluate or ask about the information supporting judgment and decision making, the explainability of AI they extract. For example, you cannot ask Alexa about the quality of systems has become a critical topic. For everyday lay-users, who the extracted information, or why or how it found the information may not have deep technical knowledge to understand and use AI that it did, when the information extracted is potentially vague or in their contexts and avoid AI’s mistakes, it is especially critical that AI systems are able to explain their processes efectively. Since misleading. Instead, such conversational systems only provide us with information that can be true or false but they do not necessarily the reasoning of a system is often abstract and can be hard to un- help us to criticize them or boost our ability to think critically. In derstand, additional research into what makes a proper explanation order to support the general population to be more critical and is required so as to design and engineer AI generated explanations that are more natural and user friendly[9, 14, 33, 43]. If not properly resilient to information of various quality, recent work has primarily designed and well suited in the context of interaction, AI-generated focused on the development of fact-checking algorithms or fake- news detectors [1, 42, 52, 63, 64]. explanations can be ignored, resisted, or over-relied upon by users. Using fact-checking systems, the truth information can be veri- People can develop over-simplifed heuristics regarding the AI’s fed for the users e.g. by checking relevant sources, and they can competence instead of making eforts to analytically consider each be provided with additional facts, and opposite opinions in their explanation and evaluate its validity and whether it supports the judgment and decision making. However, research has shown that AI’s suggestion. people often do not bother to check the accuracy of information To address the problem of over-reliance, researchers have devel- before sharing it online. Furthermore, even when people are oped explainability methods that cognitively engage the user to think about the AI classifcation [7, 36]. For instance, Buçinca et presented with corrective information, they often do not change al. developed and compared three cognitive forcing functions their beliefs [4, 58]. This suggests that fact-checking in itself is not where the user had limited access to the the AI recommendation an efective solution to the problem of misinformation. An alter- native approach is to improve human reasoning so that people are and hence would have to rely on their own inferences from infor- better able to discern between true and false information. This has mation to make a decision. They found that such cognitive forcing several advantages over fact-checking. First, it has been shown to be functions compelled more thoughtful consideration of AI generated more efective in changing people’s beliefs than if they are merely explanations and signifcantly reduced over-reliance on the AI sys- presented with corrective information. Second, it does not nec- tem in making healthy decisions about food choice. However, the essarily rely on people actually checking the factual accuracy of users also experienced these functions as being more cognitively information before sharing it (there is no need for fact-checking the demanding - hindering their desire to use AI systems with such premise of a logically invalid argument). Finally, improved human cognitive forcing functions in real-life scenarios. reasoning should also lead to better decision making in general, We believe an AI system that guides the user with intelligently which has a range of benefts beyond reducing susceptibility to formed questions, could be engage user critical thinking without imposing too strong requirements of cognitive resources. In our misinformation. work, we seek to evaluate such an approach by investigating the To this efect, researchers have explored ways in which tech- nologies can teach humans to think more critically such as the efects of AI-framed Questioning explanations inspired by Socratic construction of chatbots that teach reasoning skills such as fal- questioning on human information discernment. lacy identifcation[19, 31], probability and uncertainty.These approaches have had limited success for three reasons: (1) they 3 RESEARCH QUESTIONS & DEFINITIONS require a lot of cognitive efort from the users, (2) the learning Our study aims to explore the efects of AI-framed Questioning on examples are often so abstract that there is no guarantee that users discernment by provoking users with intelligently formed ques- will actually be able to use the skills in real-world situations, and tions when evaluating logical statements around socially divisive (3) the approaches do not work in-situ as the user comes across topics. These questions serve as a thinking scafolding for users to information in real-life but rather relies on them having learnt them evaluate and make decisions for themselves rather than taking the in advance. explanations from the AI systems at face value. CHI ’23, April 23–28, 2023, Hamburg, Germany Danry et al. In particular, our research questions are: question prompting participants to self explain their thinking re- (1) Do humans perform better at discerning the logical validity lated to the corresponding statement. Each participant was pre- of socially divisive statements when they receive feedback sented with a series of statements that can be “invalid” or “valid” from AI systems compared to when they work alone? (within-subject). To control for individual diferences, personal factors are measured and analyzed as covariates including prior (2) How do AI-framed Questioning and causal AI-explanations belief and knowledge on statement topics, trust in AI, and cogni- afect participants’ discernment of logical validity, conf- tive refection (see details in 4.5). The study was pre-registered on dence of their discernment, perceived information sufciency https://aspredicted.org/L6D_33B under #94860 before being con- by controlling personal factors (i.e. prior belief, trust in AI, ducted. cognitive refection) as covariates? (3) Do personal factors, such as prior belief, prior knowledge, 4.1 Materials trust in AI, cognitive refection (indicating the level of critical thinking)impact discernment? The statements used as stimuli in this study came from the “IBM Debater - Claims and Evidence” dataset , which contains both From these questions we derive the following hypotheses: (H1) labeled claims and labeled evidence for 58 diferent socially divisive AI and humans together work better than humans alone. (H2) AI topics, such as ‘immigration’, ‘poverty’, ‘secular societies’, etc. The framed questioning is more efective than causal explainability and claims and evidence have been labelled thematically in advance by control, and (H3) Personal factors (prior belief, prior knowldge, the authors of the original dataset to make up a total of 4,692 state- trust in AI, and cognitive refection) afects logical discernment ments of claim+evidence pairs with evidence types being ‘study’, accuracy. ‘expert’ and ‘anecdotal’ evidence. In literature, an AI explanation is defned as a description of how Given this dataset, we sampled fve topics randomly: (1) “violent an AI system arrives at answers which may vary in domain, strat- video games cause aggression”, (2) “afrmative action counters egy, content and form. The explanation domain can be either the efects of a history of discrimination”, (3) “refugees should be global or local: Local AI explanations are defned as “meaningful embraced”, (4) “Israel should lift the blockade of Gaza”, and (5) information about the calculations or logical processes involved “male infant circumcision should be less prevalent”. Within these in the processing of a particular case” — i.e. information on topics we sampled fve anecdotal claim+evidence pairs and fve why the system arrived at a particular classifcation—, while global non-anecdotal claim+evidence pairs randomly from the dataset explanations describe how the AI system arrives at classifcations for each topic (50 in total). As known in literature, statements more generally (e.g. through a decision tree architecture). The ex- that uses anecdotes to support their claims sufer from the hasty planation strategy, on the other hand, is about the method with generalization fallacy by making a general claim based on only one which the content of the domain is being delivered to the user. Ex- particular instance (“One X therefore all X”). The anecdotal amples AI explanation forms include causal explanations [11, 21], claims+evidence pairs were thus labeled “logically invalid” and analogy-based explanations , counter-factual explanations , the non-anecdotal pairs were labeled “logically valid”. Next, we dialogue-based explanations [38, 62], or self-explanations (although verifed the logical validity of each of the statements. A statement never tested with AI-systems). The explanation content is then is defned as logically valid if and only if it is impossible for the the thing that explains a model or specifc classifcation, which reasons in a statement to be true and the conclusion false. Hence, can take explanation form as text, statistical graphs, decision trees, the main claim of a statement can be false while the statement feature histograms, color gradients, feature matrices, rule sets. can be logically valid. It is not required for a valid statement to In literature trust in AI systems is often investigated under vari- have reasons that are actually true, but to have reasons that, if they ous defnitions. This paper deploys the defnition of trust used in were true, would guarantee the truth of the statement’s conclusion. as "the willingness of a [user] to be vulnerable to the actions of Conversely, a statement is logically invalid if and only if the reasons [an AI system] based on the expectation that the [AI system] will in a statement can be true and while the conclusion is false. For perform a particular action important to the [user], irrespective of example, the statement “I have an orange box. I know all orange the ability to monitor or control [the AI system]". The defnition boxes contain pears. Therefore my orange box contains pears.” is is assumed in the work of Epstein et al to denote perceived a valid statement. If it true that I have an orange box and that trustworthiness of AI systems. all orange boxes contain pears then the conclusion that my box contains pears must necessarily also be true. Conversely, if I don’t 4 EXPERIMENT have an orange box, or all orange boxes do not contain pears, my To evaluate the efects of AI-framed Questioning on human dis- orange box would also not necessarily contain pears. In contrast, cernment, we conducted a 3-by-2 factorial experimental design the statement“I have an orange box, and it doesn’t contain pears. that asked participants to evaluate the logical validity of socially Therefore goats orbit Saturn.” is an invalid statement (there is no divisive statements. Participants were randomly assigned into three logical link between my orange box without pears and whether intervention conditions (between-subjects) including: (1) control or not goats are orbiting Saturn). It could still be the case that condition - no explanation is presented with the statement, (2) goats orbit the sun, or that I have an orange box without pears. “causal AI-explanation” condition - AI provides a intelligently gen- For instance, if, for whatever reason, an alien spacecraft decided erated causal explanation related to the corresponding statement, to launch goats out of their spacecraft around Saturn - but having (3) “AI-framed Questioning” - AI provides an intelligently adapted an orange box that doesn’t contain pears in itself does not support Don’t Just Tell Me, Ask Me CHI ’23, April 23–28, 2023, Hamburg, Germany goats orbiting saturn. You could add something to connect the dots fnal number of participants was 204, after excluding the individuals but in itself it is not sufcient. who failed our attention checks or contain missing ratings on prior Using the defnition of hasty generalization fallacies and logical beliefs of statement topics. Participants were randomly assigned to validity, we corrected each of the statements in our stimulus set to each condition with the following distribution across conditions: make sure that they were either logically invalid hasty generaliza- control = 62, causal AI-explanations = 63, and AI-framed question- tion fallacies or logically valid ending up with four logically invalid ing = 79, and could complete the study either on their phone, tablet, statements and four logically valid statements for each topic (40 or computer. total) (see Table 2 for examples). In order to eliminate linguistic markers that might give the probability of logical validity away 4.4 Procedure (e.g. "Studies show.." is more positively correlated with logical valid- First, participants provided their consent and demographic infor- ity than "French gamer Julien Barreaux located..."), we eliminated mation (see Figure 3) once enrolled in the study. names and words like "researchers show", "most studies", and "ac- Second, participants rated on their prior beliefs and prior knowl- cording to published articles". The resulting statements did not edge for each topic of the statements used in the study (see section have any signifcant linguistic diferences in terms of Word Count, 4.1) from 1-7 (1 = not at all, 7 = very much). They were then ran- Flesch-Kincaid Grade Level, and sentiment. domly assigned to one of the three conditions: (1) No-Explanation, (2) Causal AI-Explanation, and (3) AI-framed Questioning. 4.2 Explanation Feedback Third, to ensure that the participants understood the concept of Since logical validity is determined by whether a statement con- "logical validity", prior to performing the statement evaluation task, clusion follows from its premises, the AI explanation feedback participants were given a one page description of logical validity templates for causal AI explanations and AI-framed Questioning in layman’s terms with examples. explanations were shaped in a way that it identifes and highlights Fourth, the participant entered the main task where they were the link between premises and conclusion (for examples see table presented with 10 statements sampled from the 40 total statement 2). The explanation feedback conditions and shape are defned as dataset of logically valid and logically invalid statements in an follows: random order. This statement evaluation task was based on prior (1) Causal AI-Explanation: The AI system gives a reason for research on a wearable AI system that supports the human reason- why the label is logically valid or logically invalid. "If ing process. For each statement, participants were presented then it follows that " for the logically valid statement with feedback based on their assigned study condition:(1) a causal and "If then it does not follow that " for the AI explanation, or (2) an AI-framed Questioning, or (3) no explana- logically invalid statement. tion or any feedback at all. To ensure that the participants read the (2) AI-Framed Questioning: AI system asks participants about entire statement, the participants needed to click "next " after read- the causal link between a reason and the system label. It takes ing each statement for the AI feedback to appear with a "slide up" a similar form as the causal AI explanation but does not make animation. After reading feedback, participants were then asked to it clear whether the label actually follows from the reason. discern the logical validity of each statement that they were pre- "If does it follow that ?" for both the logically sented with, to report their confdence in their discernment rating valid and invalid statements. of validity, and to rate whether sufcient information was given (3) No-Explanation: The AI system does not provide any ex- in the statement to say that [the claim] is true (1 = not at all, 7 = planations or feedback of any forms at all. very much). Below are the questions used in the survey for each statement: To generate the causal AI explanations in our study, we used the large language model "GPT-3". Here, we frst gave it a few (1) Do you think the statement is logically valid or invalid? examples of arguments with hand-crafted causal AI-explanations (Yes/No) following the template structure above. We then had it generate (2) How confdent are you in your rating of logical validity? (on causal AI explanations for each argument and manually checked a scale of 1-7: 1 = Not at all, 7 = Very Much). them for accuracy and consistency. We then did the exact same (3) Is sufcient information given in the statement to support procedure for the AI-framed Questioning explanations and man- [the claim of the statement]? (on a scale of 1-7: 1 = Not at ually checking that there were no linguistic diferences between all, 7 = Very Much) causal AI-explanations and the AI-framed Questioning explana- After the discernment task, participants would be asked to fll tions other than the argument specifc reason and label. While we out the post-task questionnaires on “cognitive refection test (CRT)” used GPT-3 for this task, we believed that it could easily be done and “trust in AI” questionnaire. using a rule-based approach when reason and label is known. 4.5 Measurements 4.3 Participants 4.5.1 Weighted Discernment of Logical Validity. For each statement, Participants were recruited from Prolifc, an online research par- we calculate a weighted discernment score that aggregates the ticipant pool. The total number of participants that enrolled in our raw 2-point discernment accuracy ("Correct"/"Incorrect") with the study was 234 people. All participants were from the United States accompanying confdence level (a scale of 1-7: 1 = Not at all, 7 = and fuent in English with a balanced sex distribution (50% female Very Much). The confdence will be weighted in such a way that and 50% male), a mean age of 35.3 years and being 71.2% white. The a confdence rating like "1", will bring the weight the rating of CHI ’23, April 23–28, 2023, Hamburg, Germany Danry et al. Figure 2: Example AI Explanations Figure 3: Overview over the experimental procedure. logical validity to 0.5 (the neutral middle), while a confdence of 4.5.2 Perceived Information Insuficiency. We frst measure the per- "7" will keep the rating at either invalid (0) or valid (1). We used ceived information sufciency through the self-reported scoring the following formuli to calculate weighted discernment accuracy. from 1-7 (1 = not at all, 7 = very much) on the question “Is sufcient First we calculate the discernment accuracy: information given in the statement to support [the claim of the statement]?”. In analysis, we invert 1-7 scale to report on “perceived information insufciency” for more a convenient interpretation: = 1 − || − ℎ|| a score of 1 indicates that participants fnd sufcient information is given to support the claim and thus are satisfed with the given Next, calculate the weighted factor of confdence from the conf- information, while a score of 7 indicates that participants fnd in- dence rating (0.5-"No confdence" to 1-"Fully confdent"). formation is insufcient to support the claim and thus are more likely to seek further information to validate the claim. − 1 ℎ = 0.5 ∗ (1 − ) 4.5.3 Cognitive Reflection. To measure the level of critical thinking 6 of subjects, we used cognitive refection test (CRT), a task designed Finally subtract the weighted factor of confdence from the dis- to measure a person’s ability to refect on a question and resist cernment accuracy: reporting the frst response that comes to mind. For the CRT we randomly sampled three items from the extended CRT. ℎ = − ℎ 4.5.4 Trust in AI. Finally, following Epstein et al. , partici- pants answered a battery of six trust questions derived from Mayer, The weighted discernment score becomes a continuous variable Davis, and Schoorman ’s three factors of trustworthiness: Abil- and has a range of 0-100. ity, Benevolence and Integrity (ABI). Previous work, has found that Don’t Just Tell Me, Ask Me CHI ’23, April 23–28, 2023, Hamburg, Germany the six ABI questions are highly correlated with trust ( = 0.821), discernment accuracy (F(2, 1007) = 8.4, p <.001 <.05) and the per- allowing for a single measure of trust that explains 65.3% of the ceived information insufciency (F(2, 1007) = 11.3, p <.001 <.05) overall variance. after controlling for the efects of various personal factors. Addi- tionally, prior belief signifcantly afected the weighted discernment 4.5.5 Prior Belief and Knowledge. We measured the subject’s prior accuracy (F(1, 1007) = 7.4, p =.007 <.05) and the perceived infor- belief about a topic through a self-report scoring from 1-7 (1 = mation insufciency (F(1, 1007) = 17.2, p <.001 <.05). not at all, 7 = very much) on the question "Do you believe that In summary, these fndings indicate that the types of interven- [topic]?". For example, “Do you believe that [violent video games tions have a signifcant main efect on the weighted discernment cause aggression]?” Similarly, prior knowledge is measured by 1- accuracy and the perceived information insufciency across valid 7 (1 = not at all, 7 = very much) on the question "Do you have and invalid statements after controlling various personal factors. knowledge that [topic]?" 4.6 Approvals 5.1 Humans cannot identify logical fallacies This research has been reviewed and approved by the MIT Com- very well on their own mittee on the Use of Humans as Experimental Subjects, protocol We investigated the degree to which participants were able to dis- number E-4115. The research questions and methodology has been cern the logical validity of statements and found that without as- pre-registered as "Human-AI Self-explainability" with protocol num- sistance of any AI feedback. The participants’ raw discernment ber #94860 via https://aspredicted.org/. accuracy (Mean = 44% accuracy, SD = 26) were lower than the ran- dom guess success rate between valid or invalid (50% accuracy) 4.7 Analysis when they were evaluating invalid statements, meaning that their The purpose of this experimental study is to examine the efects of responses were close to simply guessing, while participants sup- causal AI-explanations and AI-framed Questioning in supporting ported by causal AI explanations or AI framed questioning achieved human discernment of logical validity. For "Logically Valid" or "Log- a raw discernment accuracy of 57%, 67% respectively. Detailed ically Invalid" statements (based on the pre-defned logical validity MANCOVA fndings below will present the signifcant diferences of the statement stimuli by design), a multivariate Analysis of co- between the three intervention conditions. variance (MANCOVA) was conducted to examine the main efects of intervention conditions (Causal AI-explanation, AI-framed Ques- tioning, No-explanation) on participants’ weighted discernment 5.2 AI framed questioning helps improve accuracy (range: 0-100), perceived information insufciency (range: discernment best 1-7), while controlling personal factors (i.e. prior belief and prior When evaluating invalid statements, after controlling for covari- knowledge for any statement topic, trust in AI, cognitive refection) ates, both the AI framed questioning condition (mean = 62.5, Std. as covariates. Further post hoc tests with Benjamini-Hochberg cor- Error = 1.9) and the causal AI explanation condition (mean = 55.0, rection were conducted to identify how intervention conditions Std. Error = 2.2) have a signifcantly better weighted discernment difer from each other. than the control condition (mean = 46.9, Std. Error = 2.1), with ( − ) <.001 <.017 and ( − ) = 5 RESULTS.007 <.025 respectively. Moreover, those supported by AI framed Findings of valid and invalid statements are reported separately in questioning also discerned signifcantly better than those supported the following sections. by causal AI explanations ( − ) =.009 <.05. For invalid statements, MANCOVA results revealed an over- Note that the original 0.05 critical value of signifcance has been all signifcant main efect of the intervention conditions on the adjusted using Benjamini-Hochberg correction to 0.017 for the frst weighted discernment accuracy (F(2, 1007) = 15.3, p <.001 <.05) rank comparison,.025 for the second rank comparison, and to 0.05 and the perceived information insufciency (F(2, 1007) = 5.0, p for the third rank comparison among the 3 pairwise posthoc group =.007 <.05) after controlling for the efects of personal factors comparisons. (i.e. prior belief and knowledge, trust in AI, cognitive refection). When evaluating valid statements, after controlling for covari- Furthermore, several covariates were found to be signifcant pre- ates, both the AI framed questioning condition (mean = 74.7, Std. dictors of our two dependent variables, meaning they signifcantly Error = 1.6) and the causal AI explanation condition (mean = 78.2, adjusted the relationship between interventions and the two depen- Std. Error = 1.7) have a signifcantly better weighted discernment dent variables. For example, the weighted discernment accuracy than the control condition (mean = 68.2, Std. Error = 1.8), with was signifcantly afected by prior belief (F(1, 1007) = 6.9, p =.009 < ( − ) =.007 <.025 and ( − ) <.05) and cognitive refection (F(1, 1007) = 7.9, p =.005 <.05), and.001 <.017 respectively. However, the two AI intervention condi- the perceived information insufciency was signifcantly afected tions did not difer signifcantly from each other in the weighted by prior belief (F(1, 1007) = 22.7, p <.001 <.05), cognitive refection discernment accuracy, ( − ) =.126 >.05. (ad- (F(1, 1007) = 21.2, p <.001 <.05) and trust in AI (F(1, 1007) = 18.1, justed by Benjamini-Hochberg correction). p <.001 <.05). However, prior knowledge as a covariate was not In general, both AI framed questioning and causal AI expla- found signifcant. nations helped participants discern signifcantly better than no For valid statements, MANCOVA results revealed an overall sig- feedback. In particular, when encountering fallacies, participants nifcant main efect of the intervention conditions on the weighted discerned better with AI framed questioning than those with causal CHI ’23, April 23–28, 2023, Hamburg, Germany Danry et al. Figure 4: The interface for displaying feedback to participants. Left: No-Explanation, Center: Causal AI-explanations, Right: AI-framed Questioning AI explanations. In other words, AI framed questioning helps indi- with the given information and potentially would not seek further viduals discern best regardless of personal factors. information to verify the claim. 5.3 Getting causal AI explanation feedback 5.4 Personal factors play roles in the weighted lowers the perceived information discernment accuracy and the perceived insufciency information sufciency When evaluating invalid statements, after controlling for covariates, MANCOVA also revealed that several personal factors were found only the causal AI explanation condition (mean = 4.4, Std. Error = to be signifcant predictors of participants’ weighted discernment 0.1) has a signifcantly lower perceived information insufciency accuracy and perceived information sufciency. than the control condition (mean = 4.9, Std. Error = 0.1), ( − For invalid statements, a weaker prior belief (F(1, 1007) = 6.9, p = ) =.002 <.017.009 <.05) or a higher cognitive refection (F(1, 1007) = 7.9, p =.005 < When evaluating valid statements, after controlling for covari-.05) is signifcantly associated with a higher weighted discernment ates, the causal AI explanation condition (mean = 4.4, Std. Er- accuracy. Additionally, a weaker prior belief (F(1, 1007) = 22.7, p ror = 0.1) has a signifcantly lower perceived information insuf- <.001 <.05) or a higher cognitive refection (F(1, 1007) = 21.2, p fciency than the control condition (mean = 4.9, Std. Error = 0.1), <.001 <.05) or a lower trust in AI (F(1, 1007) = 18.1, p <.001 < ( − ) <.001 <.017, and the AI framed questioning.05) is signifcantly associated with a higher perceived information condition, ( − ) <.001 <.025 (adjusted by Benjamini- insufciency. Hochberg correction). For valid statements, a greater prior belief is signifcantly associ- Such a fnding suggests that individuals tend to fnd the given ated with a higher weighted discernment accuracy (F(1, 1007) = 7.4, information is sufcient enough to support the claim (as measured p =.007 <.05) and a lower perceived information insufciency (F(1, by a signifcantly lower perceived information insufciency) when 1007) = 17.2, p <.001 <.05). their judgement is corroborated by a second opinion from AI in Overall, these signifcant efects about personal factors as covari- the causal explanation form. In other words, when supported by ates from MANCOVA suggest the two AI interventions have a main causal AI explanations, individuals are more likely to be satisfed training efect in improving discernment accuracy despite personal Don’t Just Tell Me, Ask Me CHI ’23, April 23–28, 2023, Hamburg, Germany Figure 5: Overview of the efects of AI systems on human discernment of logically valid and invalid statements. (A) Weighted discernment accuracy for the diferent feedback types on logically valid and invalid statements. (B) The inverted users’ rating of information being insufcient to rate the claim as true for the diferent feedback types on logically valid and invalid statements. (C) Time for users to complete the study for the diferent feedback types. *