Detecting Hallucinations in Large Language Models Using Semantic Entropy PDF
Document Details
Uploaded by HappierEuphonium
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, Yarin Gal
Tags
Related
- Large Language Models for Software Engineering PDF
- Large Language Models PDF
- Large Language Models PDF
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- A Survey of Large Language Models PDF
Summary
This paper presents a new method for detecting hallucinations in large language models (LLMs). The method uses semantic entropy to estimate the uncertainty of LLM's output, focusing on the meaning rather than specific word sequences. The method works across datasets and tasks without needing prior knowledge.
Full Transcript
Article Detecting hallucinations in large language models using semantic entropy https://doi.org/10.1038/s41586-024-07421-0 Sebastian Farquhar1,2 ✉, Jannik Kossen1,2, Lorenz Kuhn1,2 & Yarin Gal1 Received: 17 July 2023 Accepted: 12 April 2024...
Article Detecting hallucinations in large language models using semantic entropy https://doi.org/10.1038/s41586-024-07421-0 Sebastian Farquhar1,2 ✉, Jannik Kossen1,2, Lorenz Kuhn1,2 & Yarin Gal1 Received: 17 July 2023 Accepted: 12 April 2024 Large language model (LLM) systems, such as ChatGPT1 or Gemini2, can show Published online: 19 June 2024 impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers3,4. Answering unreliably or without the Open access necessary information prevents adoption in diverse fields, with problems including Check for updates fabrication of legal precedents5 or untrue facts in news articles6 and even posing a risk to human life in medical domains such as radiology7. Encouraging truthfulness through supervision or reinforcement has been only partially successful8. Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability. ‘Hallucinations’ are a critical problem9 for natural language genera- We show how to detect confabulations by developing a quantita- tion systems using large language models (LLMs), such as ChatGPT1 or tive measure of when an input is likely to cause an LLM to generate Gemini2, because users cannot trust that any given output is correct. arbitrary and ungrounded answers. Detecting confabulations allows Hallucinations are often defined as LLMs generating “content systems built on LLMs to avoid answering questions likely to cause that is nonsensical or unfaithful to the provided source content”9–11 confabulations, to make users aware of the unreliability of answers but they have come to include a vast array of failures of faithfulness to a question or to supplement the LLM with more grounded search and factuality. We focus on a subset of hallucinations which we call or retrieval. This is essential for the critical emerging field of free- ‘confabulations’12 for which LLMs fluently make claims that are both form generation in which naive approaches, suited to closed vocabu- wrong and arbitrary—by which we mean that the answer is sensitive lary and multiple choice, fail. Past work on uncertainty for LLMs has to irrelevant details such as random seed. For example, when asked a focused on simpler settings, such as classifiers16,17 and regressors18,19, medical question “What is the target of Sotorasib?” an LLM confabu- whereas the most exciting applications of LLMs relate to free-form lates by sometimes answering KRASG12 ‘C’ (correct) and other times generations. KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish The term hallucination in the context of machine learning originally this from cases in which a similar ‘symptom’ is caused by the following comes from filling in ungrounded details, either as a deliberate strat- different mechanisms: when LLMs are consistently wrong as a result of egy20 or as a reliability problem4. The appropriateness of the meta- being trained on erroneous data such as common misconceptions13; phor has been questioned as promoting undue anthropomorphism21. when the LLM ‘lies’ in pursuit of a reward14; or systematic failures of Although we agree that metaphor must be used carefully with LLMs22, reasoning or generalization. We believe that combining these dis- the widespread adoption of the term hallucination reflects the fact tinct mechanisms in the broad category hallucination is unhelpful. that it points to an important phenomenon. This work represents a Our method makes progress on a portion of the problem of providing step towards making that phenomenon more precise. scalable oversight15 by detecting confabulations that people might To detect confabulations, we use probabilistic tools to define and otherwise find plausible. However, it does not guarantee factuality then measure the ‘semantic’ entropy of the generations of an LLM—an because it does not help when LLM outputs are systematically bad. Nev- entropy that is computed over meanings of sentences. High entropy ertheless, we significantly improve question-answering accuracy for corresponds to high uncertainty23–25—so semantic entropy is one way state-of-the-art LLMs, revealing that confabulations are a great source of to estimate semantic uncertainties. Semantic uncertainty, the broader error at present. category of measures we introduce, could be operationalized with other OATML, Department of Computer Science, University of Oxford, Oxford, UK. 2These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn. ✉e-mail: [email protected] 1 Nature | Vol 630 | 20 June 2024 | 625 Article a Semantic entropy Misleadingly high naive entropy Low semantic entropy LLM answers Probability LLM answers Probability Cluster Paris answers by Paris semantic It’s Paris meaning It’s Paris User: Question Generate Where is the LLM France’s capital Paris LLM France’s capital Paris Eiffel Tower? Rome Rome It’s Rome It’s Rome Berlin Berlin b Application to FactualBio paragraphs User: Question Generate Who is Freddie Frith? LLM Generate answers and Semantic entropy Factoid decomposition Possible questions cluster by meaning probability Freddie Frith was an Q1 of M for Fact 1: Five-time motorcycle racing world... What notable Not likely confabulation English motorcycle accomplishments Motorcycle road racing world... road racer who became a champion did Freddie Frith Fact 1 of 7 : achieve? Motorcycle racing champion... in both pre-World Freddie Frith was an War II and post-war English motorcycle eras. He won the road racer Motorcycle racing. 1935 and 1937 Grand Q2 of M for Fact 1: Prix motorcycle He was a world champion motorcycle... What was Freddie racing European Frith known for? Championships. He was president of the After retiring Auto Cycle Union... from competition, he became the Q1 of M for Fact 6: 1909 president of the Auto Likely confabulation Cycle Union, the When was Freddie 1909 governing body of Frith’s year of British motorcycle birth? 1909 Fact 6 of 7 : racing. He was also Frith was born an accomplished in 1911 30 May 1909 motorcycle dealer Q2 of M for Fact 6: and manufacturer. When was Freddie 29 March 1909 Frith was born in 1911 Frith born? and died in 1988. 26 October 1911 Fig. 1 | Overview of semantic entropy and confabulation detection. a, Naive factoids. For each factoid, an LLM generates questions to which that factoid entropy-based uncertainty measures variation in the exact answers, treating might have been the answer. The original LLM then samples M possible answers ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable to these questions. Finally, we compute the semantic entropy over the answers for language tasks for which sometimes different answers mean the same to each specific question, including the original factoid. Confabulations are things. Our semantic entropy clusters answers which share meanings before indicated by high average semantic entropy for questions associated with that computing the entropy. A low semantic entropy shows that the LLM is confident factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation about the meaning. b, Semantic entropy can also detect confabulations in because generations often mean the same thing, despite very different longer passages. We automatically decompose a long generated answer into wordings, which a naive entropy would have missed. measures of uncertainty, such as mutual information, instead. Entropy well as being used to measure factuality in abstractive summarization30, in free-form generation is normally hard to measure because answers especially when applied at the right granularity31. might mean the same thing (be semantically equivalent) despite being Semantic entropy detects confabulations in free-form text genera- expressed differently (being syntactically or lexically distinct). This tion across a range of language models and domains, without previous causes naive estimates of entropy or other lexical variation scores26 to domain knowledge. Our evaluations cover question answering in trivia be misleadingly high when the same correct answer might be written knowledge (TriviaQA32), general knowledge (SQuAD 1.1; ref. 33), life in many ways without changing its meaning. sciences (BioASQ34) and open-domain natural questions (NQ-Open35) By contrast, our semantic entropy moves towards estimating the derived from actual queries to Google Search36. In addition, seman- entropy of the distribution of meanings of free-form answers to ques- tic entropy detects confabulations in mathematical word problems tions, insofar as that is possible, rather than the distribution over the (SVAMP37) and in a biography-generation dataset, FactualBio, accom- ‘tokens’ (words or word-pieces) which LLMs natively represent. This panying this paper. can be seen as a kind of semantic consistency check27 for random seed Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are variation. An overview of our approach is provided in Fig. 1 and a worked all evaluated context-free and involve sentence-length answers (96 ± 70 example in Supplementary Table 1. characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B param- Intuitively, our method works by sampling several possible answers eters)38, Falcon Instruct (7B and 40B)39 and Mistral Instruct (7B)40. In the to each question and clustering them algorithmically into answers that Supplementary Information, we further consider short-phrase-length have similar meanings, which we determine on the basis of whether answers. Results for FactualBio (442 ± 122 characters) use GPT-4 answers in the same cluster entail each other bidirectionally28. That (ref. 1). At the time of writing, GPT-4 (ref. 1) did not expose output prob- is, if sentence A entails that sentence B is true and vice versa, then we abilities41 or hidden states, although it does now. As a result, we propose consider them to be in the same semantic cluster. We measure entail- a discrete approximation of our estimator for semantic entropy which ment using both general-purpose LLMs and natural language inference allows us to run experiments without access to output probabilities, (NLI) tools developed specifically for detecting entailment for which which we use for all GPT-4 results in this paper and which performs we show direct evaluations in Supplementary Tables 2 and 3 and Sup- similarly well. plementary Fig. 1. Textual entailment has previously been shown to Our confabulation detection with semantic entropy is more robust correlate with faithfulness10 in the context of factual consistency29 as to user inputs from previously unseen domains than methods which 626 | Nature | Vol 630 | 20 June 2024 Semantic entropy Naive entropy Embedding regression Discrete semantic entropy P(True), ref. 24 Embedding regression - OOD 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 AUROC AURAC AUROC AURAC AUROC AURAC LLaMA 2 Chat 7B LLaMA 2 Chat 13B LLaMA 2 Chat 70B 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 AUROC AURAC AUROC AURAC AUROC AURAC Falcon 7B Instruct Falcon 40B Instruct Mistral 7B Instruct Fig. 2 | Detecting confabulations in sentence-length generations. Semantic performance improvement of a system that refuses to answer questions which entropy outperforms leading baselines and naive entropy. AUROC (scored on are judged likely to cause confabulations. Results are an average over five the y-axes) measures how well methods predict LLM mistakes, which correlate datasets, with individual metrics provided in the Supplementary Information. with confabulations. AURAC (likewise scored on the y-axes) measures the aim to ‘learn’ how to detect confabulations from a set of example dem- onstrations. Our method is unsupervised, meaning that we do not Detecting confabulations in QA and math need labelled examples of confabulations. By contrast, supervised In Fig. 2, we show that both semantic entropy and its discrete approxi- methods detect confabulations by learning patterns behind examples mation outperform our best baselines for sentence-length generations. of confabulations, assuming that future questions preserve these pat- These results are averaged across datasets and provide the actual scores terns. But this assumption is often untrue in new situations or with on the held-out evaluation dataset. We report the raw average score confabulations that human overseers are unable to identify (com- across held-out evaluation datasets without standard error because pare Fig. 17 of ref. 24). As a strong supervised baseline, we compare the distributional characteristics are more a property of the models to an embedding regression method inspired by ref. 24 which trains and datasets selected than the method. Consistency of relative results a logistic regression classifier to predict whether the model correctly across different datasets is a stronger indicator of variation in this case. answered a question on the basis of the final ‘embedding’ (hidden state) Semantic entropy greatly outperforms the naive estimation of uncer- of the LLM. We also use the P(True) method24 which looks at the prob- tainty using entropy: computing the entropy of the length-normalized ability with which an LLM predicts that the next token is ‘True’ when joint probability of the token sequences. Naive entropy estimation few-shot prompted to compare a main answer with ‘brainstormed’ ignores the fact that token probabilities also express the uncertainty of alternatives. the model over phrasings that do not change the meaning of an output. Confabulations contribute substantially to incorrect answers given Our methods also outperform the supervised embedding regression by language models. We show that semantic entropy can be used to pre- method both in- and out-of-distribution. In pale-yellow bars we show dict many incorrect model answers and to improve question-answering that embedding regression performance deteriorates when its train- accuracy by refusing to answer those questions the model is uncertain ing data do not match the deployment distribution—which mirrors the about. Corresponding to these two uses, we evaluate two main metrics. common real-world case in which there is a distribution shift between First, the widely used area under the receiver operating characteristic training and deployment42—the plotted value is the average metric for (AUROC) curve for the binary event that a given answer is incorrect. embedding regression trained on one of the four ‘off-distribution’ This measure captures both precision and recall and ranges from datasets for that evaluation. This is critical because reliable uncertainty 0 to 1, with 1 representing a perfect classifier and 0.5 representing is most important when the data distribution shifts. Semantic entropy an un-informative classifier. We also show a new measure, the area also outperforms P(True) which is supervised ‘in-context’; that is, it is under the ‘rejection accuracy’ curve (AURAC). This studies the case in adapted to the deployment task with a few training examples provided which the confabulation detection score is used to refuse to answer in the LLM prompt itself. The discrete variant of semantic entropy per- the questions judged most likely to cause confabulations. Rejection forms similarly to our standard estimator, despite not requiring exact accuracy is the accuracy of the answers of the model on the remaining output probabilities. questions and the area under this curve is a summary statistic over Averaged across the 30 combinations of tasks and models we study, many thresholds (representative threshold accuracies are provided in semantic entropy achieves the best AUROC value of 0.790 whereas Supplementary Material). The AURAC captures the accuracy improve- naive entropy (0.691), P(True) (0.698) and the embedding regression ment which users would experience if semantic entropy was used to baseline (0.687) lag behind it. Semantic entropy performs well con- filter out questions causing the highest entropy. sistently, with stable performance (between 0.78 and 0.81 AUROC) Nature | Vol 630 | 20 June 2024 | 627 Article Table 1 | Semantic entropy applied to examples Question Model answer Reference Generations for entropy Prediction answer Semantic entropy Naive entropy All the above are under the industrial sector of Refineries, process Refineries, process construction. chemical, power chemical, power Industrial or heavy The refineries, process chemical, power generation, mills and generation, mills and civil or heavy generation, mills and manufacturing plants are Not confabulation Confabulation manufacturing plants manufacturing plants engineering. under the industrial sector of construction. are under what sector are under the industrial of construction? sector of construction. These are all under the heavy industrial sector of construction. The stard10 protein plays a crucial role in regulating the dynamics of the actin cytoskeleton and cell adhesion during cell The stard10 protein migration and cell division. functions as a negative The stard10 protein functions as a negative regulator of the mTOR What is the function of STARD10 is a lipid regulator of the meiotic recombination process, pathway by inhibiting Confabulation Confabulation the stard10 protein? transfer protein. specifically by inhibiting the activity of the Dmc1 the activity of the recombinase enzyme. mTORC1 and mTORC2 complexes. The stard10 protein functions as a regulator of lipid metabolism, promoting the breakdown of lipids in the liver and inhibiting their synthesis in the liver and adipose tissue. ‘Fado’ is the national music of Portugal. Of which European country is ‘fado’ the Portugal. Portugal. ‘Fado’ is the national music of Portugal. Not confabulation Not confabulation national music? Fado is the national music of Portugal. BSkyB’s digital service was launched in 1998. When was BSkyB’s BSkyB’s digital BSkyB’s digital service was launched on digital service 1 October 1998. service was 1 October 1998. Confabulation Not confabulation launched? launched in 1998. BSkyB’s digital service was launched on 1 October 1998. The first row of Table 1 demonstrates a case in which semantic entropy correctly predicts that an answer is not a confabulation if naive entropy would incorrectly predict a confabulation. All of the generations from the model mean the same thing as each other so they are clustered together despite using different phrasings. The second row provides an example in which semantic entropy and naive entropy would both correctly predict a confabulation, in which each generation is both lexically distinct and also means something different. The third row is an example in which semantic entropy and naive entropy would both correctly predict no confabulation because the multiple generations are almost lexically identical. The fourth row gives an example in which semantic entropy might fail but naive entropy might succeed. In our experiment, semantic entropy clustered the answers into those which provided a specific date and those which gave only a year and treated the model as ‘uncertain’. This highlights the importance of context in semantic clustering. The examples come from LLaMA 2 Chat 70B generations for SQuAD, BioASQ and TriviaQA. across the different model families (LLaMA, Falcon and Mistral) and entropy and naive entropy both correctly predict the presence of scales (from 7B to 70B parameters) which we study (we report summary confabulations when the form and meaning vary together (second statistics for each dataset and model as before). Although semantic row) and predict the absence of confabulations when the form and entropy outperforms the baselines across all model sizes, P(True) meaning are both constant across several resampled generations seems to improve with model size, suggesting that it might become (third row). In the final row, we give an example in which semantic more competitive for very capable honest models in settings that the entropy is erroneously high as a result of overly sensitive semantic model understands well (which are, however, not the most important clustering relative to the reference answer. Our clustering method cases to have good uncertainty). We use ten generations to compute distinguishes the answers which provide a precise date from those entropy, selected using analysis in Supplementary Fig. 2. Further which only provide a year. For some contexts that would have been results for short-phrase generations are described in Supplementary correct but in this context the distinction between the specific day Figs. 7–10. and the year is probably irrelevant. This highlights the importance The results in Fig. 2 offer a lower bound on the effectiveness of seman- of context and judgement in clustering, especially in subtle cases, as tic entropy at detecting confabulations. These evaluations determine well as the shortcomings of evaluating against fixed reference answers whether semantic entropy and baseline methods can detect when the which do not capture the open-ended flexibility of conversational answers of the model are incorrect (which we validate against human deployments of LLMs. correctness evaluations in Supplementary Table 4). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as Detecting confabulations in biographies consistent errors learned from the training data. The fact that methods Semantic entropy is most natural for sentences that express a single such as embedding regression are able to spot other kinds of errors, not proposition but the idea of semantic equivalence is trickier to apply to just confabulations, but still are outperformed by semantic entropy, longer passages which express many propositions which might only suggests that confabulations are a principal category of errors for agree partially43. Nevertheless, we can use semantic entropy to detect actual generations. confabulations in longer generations, such as entire paragraphs of text. Examples of questions and answers from TriviaQA, SQuAD and To show this, we develop a dataset of biographical generations from BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1. These illustrate GPT-4 (v.0613) for 21 individuals notable enough to have their own how only semantic entropy detects when the meaning is constant Wikipedia page but without extensive online biographies. From each but the form varies (the first row of the table) whereas semantic biography generated by GPT-4, we automatically extract propositional 628 | Nature | Vol 630 | 20 June 2024 Discrete semantic entropy lack of LLM knowledge. These are a substantial portion of the failures at 0.8 P(True), ref. 24 variant present and will continue even as models grow in capabilities because Self-check baseline situations and cases that humans cannot reliably supervise will persist. 0.7 Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adap- 0.6 tations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input 0.5 variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination44 for scalable 0.4 oversight through debate45. AUROC AURAC 80% 90% 95% 100% The success of semantic entropy at detecting errors suggests that Rejection accuracy LLMs are even better at “knowing what they don’t know” than was argued by ref. 24—they just don’t know they know what they don’t Fig. 3 | Detecting GPT-4 confabulations in paragraph-length biographies. know. Our method explicitly does not directly address situations in The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y-axis). The which LLMs are confidently wrong because they have been trained AUROC and AURAC are substantially higher than for both baselines. At above with objectives that systematically produce dangerous behaviour, 80% of questions being answered, semantic entropy has the highest accuracy. cause systematic reasoning errors or are systematically mislead- Only when the top 20% of answers judged most likely to be confabulations are ing the user. We believe that these represent different underlying rejected does the answer accuracy on the remainder for the P(True) baseline mechanisms—despite similar ‘symptoms’—and need to be handled exceed semantic entropy. separately. One exciting aspect of our approach is the way it makes use of clas- sical probabilistic machine learning methods and adapts them to the factual claims about the individual (150 factual claims in total), which unique properties of modern LLMs and free-form language generation. we manually label as true or false. We hope to inspire a fruitful exchange of well-studied methods and Applying semantic entropy to this problem is challenging. Naively, emerging new problems by highlighting the importance of meaning one might simply regenerate each sentence (conditioned on the text when addressing language-based machine learning problems. so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next Online content time profession. This is analogous to the original problem semantic Any methods, additional references, Nature Portfolio reporting summa- entropy was designed to resolve: the model is uncertain about the right ries, source data, extended data, supplementary information, acknowl- ordering of facts, not about the facts themselves. To address this, we edgements, peer review information; details of author contributions break down the entire paragraph into factual claims and reconstruct and competing interests; and statements of data and code availability questions which might have been answered by those claims. Only then are available at https://doi.org/10.1038/s41586-024-07421-0. do we apply semantic entropy (Fig. 1) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4) 1. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023). and computing the semantic entropy over those generations plus the 2. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/ original factual claim. We aggregate these by averaging the semantic 2312.11805 (2023). 3. Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language entropy over all the questions to get an uncertainty score for each generation. In Proc. 16th Conference of the European Chapter of the Association for proposition, which we use to detect confabulations. Unaggregated Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021). results are shown in Supplementary Figs. 5 and 6. 4. Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language As GPT-4 did not allow access to the probability of the generation Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association at the time of writing, we use a discrete variant of semantic entropy for Computational Linguistics, 2018). which makes the further approximation that we can infer a discrete 5. Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023). empirical distribution over semantic meaning clusters from only the 6. Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng. 146, 102182 generations (Methods). This allows us to compute semantic entropy (2023). using only the black-box outputs of an LLM. However, we were unable 7. Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307, e230163 (2023). to compute the naive entropy baseline, the standard semantic entropy 8. Schulman, J. Reinforcement learning from human feedback: progress and challenges. estimator or the embedding regression baseline for GPT-4 without Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/ output probabilities and embeddings. watch?v=hhiLw5Q_UFg (2023). 9. Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv.55, In Fig. 3 we show that the discrete variant of semantic entropy effec- 248 (2023). tively detects confabulations on this dataset. Its AUROC and AURAC are 10. Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in higher than either a simple ‘self-check’ baseline—which just asks the abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 LLM whether the factoid is likely to be true—or a variant of P(True) which (Association for Computational Linguistics, 2020). has been adapted to work for the paragraph-length setting. Discrete 11. Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. semantic entropy has better rejection accuracy performance until In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020). 20% of the questions have been rejected at which point P(True) has 12. Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7, 225–241 (1998). a narrow edge. This indicates that the questions predicted to cause 13. Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. confabulations are indeed more likely to be wrong. Transact. Mach. Learn. Res. (2022). 14. Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021). 15. Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/ Discussion 1606.06565 (2016). 16. Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? Our probabilistic approach, accounting for semantic equivalence, On the calibration of language models for question answering. Transact. Assoc. Comput. detects an important class of hallucinations: those that are caused by a Linguist. 9, 962–977 (2021). Nature | Vol 630 | 20 June 2024 | 629 Article 17. Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference 34. Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., indexing and question answering competition. BMC Bioinformatics 16, 138 (2015). He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020). 35. Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain 18. Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation question answering. In Proc. 57th Annual Meeting of the Association for Computational evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Linguistics 6086–6096 (Association for Computational Linguistics, 2019). Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational 36. Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Linguistics, 2021). Transact. Assoc. Comput. Linguist. 7, 452–466 (2019). 19. Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of 37. Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10, 680–696 word problems? In Proc. 2021 Conference of the North American Chapter of the Association (2022). for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 20. Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference 2080–2094 (Assoc. Comp. Linguistics, 2021). on Automatic Face and Gesture Recognition. 83–88 (IEEE, Catalogue no PR00580, 38. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at 2002). https://arxiv.org/abs/2307.09288 (2023). 21. Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that 39. Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated has got to stop. Forbes Magazine (24 August 2022). corpora with web data, and web data only. In Proc. 36th Conference on Neural 22. Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 67, 68–79 (2024). 2023) 23. MacKay, D. J. C. Information-based objective functions for active data selection. Neural 40. Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023). Comput. 4, 590–604 (1992). 41. Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box 24. Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https:// hallucination detection for generative large language models. In Findings of the arxiv.org/abs/2207.05221 (2022). Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 25. Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. 9004–9017 (Assoc. Comp. Linguistics, 2023). Stat. 27, 986–1005 (1956). 42. Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic 26. Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023). Neural Information Processing Systems (NeurIPS, Vancouver, 2019). 43. Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair 27. Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research NLI models to reason over long documents and clusters. In Findings of the Association for Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_ Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for EpsnjrC1dwZXR37PC8/edit. Computational Linguistics, 2022). 28. Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: 44. Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via- Conference on Empirical Methods in Natural Language Processing 670–679 (Association debate-1 (2020). for Computational Linguistics, 2011). 45. Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/ 29. Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second abs/1805.00899 (2018). DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in 30. Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated published maps and institutional affiliations. summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019). Open Access This article is licensed under a Creative Commons Attribution 31. Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based 4.0 International License, which permits use, sharing, adaptation, distribution models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10, and reproduction in any medium or format, as long as you give appropriate 163–177 (2022). credit to the original author(s) and the source, provide a link to the Creative Commons licence, 32. Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised and indicate if changes were made. The images or other third party material in this article are challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the included in the article’s Creative Commons licence, unless indicated otherwise in a credit line Association for Computational Linguistics 1601–1611 (Association for Computational to the material. If material is not included in the article’s Creative Commons licence and your Linguistics. 2017). intended use is not permitted by statutory regulation or exceeds the permitted use, you will 33. Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine need to obtain permission directly from the copyright holder. To view a copy of this licence, compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language visit http://creativecommons.org/licenses/by/4.0/. Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016). © The Author(s) 2024 630 | Nature | Vol 630 | 20 June 2024 Methods Principles of semantic uncertainty Semantic entropy as a strategy for overcoming confabulation builds If we naively calculate the predictive entropy directly from the probabil- on probabilistic tools for uncertainty estimation. It can be applied ities of the generated sequence of tokens, we conflate the uncertainty directly to any LLM or similar foundation model without requiring any of the model over the meaning of its answer with the uncertainty over modifications to the architecture. Our ‘discrete’ variant of semantic the exact tokens used to express that meaning. For example, even if the uncertainty can be applied even when the predicted probabilities for model is confident in the meaning of a generation, there are still usually the generations are not available, for example, because access to the many different ways for phrasing that generation without changing its internals of the model is limited. meaning. For the purposes of detecting confabulations, the uncertainty In this section we introduce background on probabilistic methods of the LLM over meanings is more important than the uncertainty over and uncertainty in machine learning, discuss how it applies to language the exact tokens used to express those meanings. models and then discuss our contribution, semantic entropy, in detail. Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not Background the choice of words. To do this, we introduce an algorithm that clusters Uncertainty and machine learning. We aim to detect confabulations model generations by meaning and subsequently calculates semantic in LLMs, using the principle that the model will be uncertain about uncertainty. At a high level this involves three steps: generations for which its output is going to be arbitrary. 1. Generation: sample output sequences of tokens from the predictive One measure of uncertainty is the predictive entropy of the output distribution of a LLM given a context x. distribution, which measures the information one has about the output 2. Clustering: cluster sequences by their meaning using our clustering given the input25. The predictive entropy (PE) for an input sentence x algorithm based on bidirectional entailment. is the conditional entropy (H) of the output random variable Y with 3. Entropy estimation: estimate semantic entropy by summing prob- realization y given x, abilities of sequences that share a meaning following equation (2) and compute their entropy. PE(x) = H (Y ∣x) = − ∑ P( y∣x)lnP( y∣x). (1) Generating a set of answers from the model. Given some context x y as input to the LLM, we sample M sequences, {s(1), …, s(M)} and record their token probabilities, {P(s(1)∣x), …, P(s(M)∣x)}. We sample all our gen- A low predictive entropy indicates an output distribution which is erations from a single model, varying only the random seed used for heavily concentrated whereas a high predictive entropy indicates that sampling from the token probabilities. We do not observe the method many possible outputs are similarly likely. to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling Aleatoric and epistemic uncertainty. We do not distinguish between (P = 0.9) (ref. 49) and top-K sampling (K = 50) (ref. 50). We also sample aleatoric and epistemic uncertainty in our analysis. Researchers some- a single generation at low temperature (0.1) as an estimate of the ‘best times separate aleatoric uncertainty (uncertainty in the underlying generation’ of the model to the context, which we use to assess the data distribution) from epistemic uncertainty (caused by having only accuracy of the model. (A lower sampling temperature increases the limited information)46. Further advances in uncertainty estimation probability of sampling the most likely tokens). which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond Clustering by semantic equivalence. To estimate semantic entropy entropy. we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other. Joint probabilities of sequences of tokens. Generative LLMs produce This can be described using ‘semantic equivalence’ which is the rela- strings of text by selecting tokens in sequence. Each token is a wordpiece tion that holds between two sentences when they mean the same thing. that often represents three or four characters (though especially com- We can formalize semantic equivalence mathematically. Let the space mon sequences and important words such as numbers typically get of tokens in a language be T. The space of all possible sequences of their own token). To compute entropies, we need access to the prob- tokens of length N is then SN ≡ T N. Note that N can be made arbitrarily abilities the LLM assigns to the generated sequence of tokens. The large to accommodate whatever size of sentence one can imagine and probability of the entire sequence, s, conditioned on the context, x, is one of the tokens can be a ‘padding’ token which occurs with certainty the product of the conditional probabilities of new tokens given past for each token after the end-of-sequence token. For some sentence tokens, whose resulting log-probability is logP(s∣x ) = ∑i logP(si∣s < i , x ), s ∈ SN , composed of a sequence of tokens, si ∈ T , there is an associ- where si is the ith output token and s