Chain of Thought Prompting - Large Language Models PDF
Document Details

Uploaded by ContrastyDivisionism4235
2023
Jason Wei
Tags
Related
- Prompt Engineering Techniques PDF
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- A Survey of Large Language Models PDF
- Mastering ChatGPT: A Comprehensive Learning Module PDF
- LangChain Introduction & Components PDF
Summary
This paper explores how generating a chain of thought significantly improves the ability of large language models to perform complex reasoning. The method, called chain-of-thought prompting, involves providing a few chain of thought demonstrations as exemplars in prompting. Experiments on large language models show a striking improvement on arithmetic, commonsense, and symbolic reasoning tasks.
Full Transcript
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma...
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou arXiv:2201.11903v6 [cs.CL] 10 Jan 2023 Google Research, Brain Team {jasonwei,dennyzhou}@google.com Abstract We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of- thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier. Standard Prompting Chain-of-Thought Prompting Model Input Model Input Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? tennis balls does he have now? A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to do they have? make lunch and bought 6 more, how many apples do they have? Model Output Model Output A: The answer is 27. A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted. 36th Conference on Neural Information Processing Systems (NeurIPS 2022). 1 Introduction The NLP landscape has recently been revolutionized by Finetuned GPT-3 175B language models (Peters et al., 2018; Devlin et al., 2019; Prior best Brown et al., 2020, inter alia). Scaling up the size of lan- PaLM 540B: standard prompting guage models has been shown to confer a range of benefits, PaLM 540B: chain-of-thought prompting such as improved performance and sample efficiency (Ka- plan et al., 2020; Brown et al., 2020, inter alia). However, 100 scaling up model size alone has not proved sufficient for Solve rate (%) 80 achieving high performance on challenging tasks such as 57 60 55 arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021). 40 33 This work explores how the reasoning ability of large 18 20 language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic 0 reasoning can benefit from generating natural language Math Word Problems (GSM8K) rationales that lead to the final answer. Prior work has given models the ability to generate natural language inter- Figure 2: PaLM 540B uses chain-of- mediate steps by training from scratch (Ling et al., 2017) thought prompting to achieve new state- or finetuning a pretrained model (Cobbe et al., 2021), in of-the-art performance on the GSM8K addition to neuro-symbolic methods that use formal lan- benchmark of math word problems. guages instead of natural language (Roy and Roth, 2015; Finetuned GPT-3 and prior best are from Chiang and Chen, 2019; Amini et al., 2019; Chen et al., Cobbe et al. (2021). 2019). Second, large language models offer the exciting prospect of in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt” the model with a few input–output exemplars demonstrating the task. Remarkably, this has been successful for a range of simple question-answering tasks (Brown et al., 2020). Both of the above ideas, however, have key limitations. For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales, which is much more complicated than simple input–output pairs used in normal machine learning. For the traditional few- shot prompting method used in Brown et al. (2020), it works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale (Rae et al., 2021). In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, given a prompt that consists of triples: hinput, chain of thought, outputi. A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting. An example prompt is shown in Figure 1. We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree. Figure 2 illustrates one such result—on the GSM8K benchmark of math word problems (Cobbe et al., 2021), chain-of-thought prompting with PaLM 540B outperforms standard prompting by a large margin and achieves new state-of-the-art performance. A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task (c.f. automatically learning the patterns underlying inputs and outputs via a large training dataset). 2 Chain-of-Thought Prompting Consider one’s own thought process when solving a complicated reasoning task such as a multi-step math word problem. It is typical to decompose the problem into intermediate steps and solve each before giving the final answer: “After Jane gives 2 flowers to her mom she has 10... then after she gives 3 to her dad she will have 7... so the answer is 7.” The goal of this paper is to endow language models with the ability to generate a similar chain of thought—a coherent series of intermediate reasoning steps that lead to the final answer for a problem. We will show that sufficiently large 2 language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting. Figure 1 shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer (and also, solutions/explanations typically come after the final answer (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022, inter alia)). Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models. 1. First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps. 2. Second, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model’s computations that support an answer remains an open question). 3. Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language. 4. Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting. In empirical experiments, we will observe the utility of chain-of-thought prompting for arithmetic reasoning (Section 3), commonsense reasoning (Section 4), and symbolic reasoning (Section 5). 3 Arithmetic Reasoning We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chain- of-thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021). 3.1 Experimental Setup We explore chain-of-thought prompting for various language models on multiple benchmarks. Benchmarks. We consider the following five math word problem benchmarks: (1) the GSM8K benchmark of math word problems (Cobbe et al., 2021), (2) the SVAMP dataset of math word problems with varying structures (Patel et al., 2021), (3) the ASDiv dataset of diverse math word problems (Miao et al., 2020), (4) the AQuA dataset of algebraic word problems, and (5) the MAWPS benchmark (Koncel-Kedziorski et al., 2016). Example problems are given in Appendix Table 12. Standard prompting. For the baseline, we consider standard few-shot prompting, popularized by Brown et al. (2020), in which a language model is given in-context exemplars of input–output pairs before outputting a prediction for a test-time example. Exemplars are formatted as questions and answers. The model gives the answer directly, as shown in Figure 1 (left). Chain-of-thought prompting. Our proposed approach is to augment each exemplar in few-shot prompting with a chain of thought for an associated answer, as illustrated in Figure 1 (right). As most of the datasets only have an evaluation split, we manually composed a set of eight few-shot exemplars with chains of thought for prompting—Figure 1 (right) shows one chain of thought exemplar, and the full set of exemplars is given in Appendix Table 20. (These particular exemplars did not undergo prompt engineering; robustness is studied in Section 3.4 and Appendix A.2.) To investigate whether chain-of-thought prompting in this form can successfully elicit successful reasoning across a range of 3 Math Word Problems (free response) Math Word Problems (multiple choice) CSQA (commonsense) Q: Roger has 5 tennis balls. He buys Q: How many keystrokes are needed Q: Sammy wanted to go to where the 2 more cans of tennis balls. Each can to type the numbers from 1 to 500? people were. Where might he go? has 3 tennis balls. How many tennis Answer Choices: (a) 1156 (b) 1392 (c) 1480 Options: (a) race track (b) populated areas (d) 1562 (e) 1788 (c) desert (d) apartment (e) roadblock balls does he have now? A: There are 9 one-digit numbers A: Roger started with 5 balls. 2 cans A: The answer must be a place with a from 1 to 9. There are 90 two-digit of 3 tennis balls each is 6 tennis numbers from 10 to 99. There are lot of people. Race tracks, desert, balls. 5 + 6 = 11. The answer is 11. apartments, and roadblocks don't 401 three-digit numbers from 100 to have a lot of people, but populated 500. 9 + 90(2) + 401(3) = 1392. The areas do. So the answer is (b). answer is (b). StrategyQA Date Understanding Sports Understanding Q: Yes or no: Would a pear sink in Q: The concert was scheduled to be Q: Is the following sentence water? on 06/01/1943, but was delayed by plausible? "Joao Moutinho caught the one day to today. What is the date 10 screen pass in the NFC A: The density of a pear is about 0.6 days ago in MM/DD/YYYY? championship." g/cm^3, which is less than water. Thus, a pear would float. So the A: One day after 06/01/1943 is A: Joao Moutinho is a soccer player. answer is no. 06/02/1943, so today is 06/02/1943. The NFC championship is part of 10 days before today is 05/23/1943. American football, not soccer. So the So the answer is 05/23/1943. answer is no. SayCan (Instructing a robot) Last Letter Concatenation Coin Flip (state tracking) Human: How would you bring me Q: Take the last letters of the words Q: A coin is heads up. Maybelle flips something that isn’t a fruit? in “Lady Gaga” and concatenate the coin. Shalonda does not flip the them. coin. Is the coin still heads up? Explanation: the user wants something to eat that isn’t a fruit. An A: The last letter of “Lady” is “y”. The A: The coin was flipped by Maybelle. energy bar is not a fruit, so I will bring last letter of “Gaga” is “a”. So the coin was flipped 1 time, which the user an energy bar. Concatenating them is “ya”. So the is an odd number. The coin started Plan: 1. find(energy bar) 2. answer is ya. heads up, so after an odd number of pick(energy bar) 3. find(user) 4. flips, it will be tails up. So the answer put(energy bar) 5. done(). is no. Figure 3: Examples of hinput, chain of thought, outputi triples for arithmetic, commonsense, and symbolic reasoning benchmarks. Chains of thought are highlighted. Full prompts in Appendix G. math word problems, we used this single set of eight chain of thought exemplars for all benchmarks except AQuA, which is multiple choice instead of free response. For AQuA, we used four exemplars and solutions from the training set, as given in Appendix Table 21. Language models. We evaluate five large language models. The first is GPT-3 (Brown et al., 2020), for which we use text-ada-001, text-babbage-001, text-curie-001, and text-davinci-002, which presumably correspond to InstructGPT models of 350M, 1.3B, 6.7B, and 175B parameters (Ouyang et al., 2022).The second is LaMDA (Thoppilan et al., 2022), which has models of 422M, 2B, 8B, 68B, and 137B parameters. The third is PaLM, which has models of 8B, 62B, and 540B parameters. The fourth is UL2 20B (Tay et al., 2022), and the fifth is Codex (Chen et al., 2021, code-davinci-002 in the OpenAI API). We sample from the models via greedy decoding (though follow-up work shows chain-of-thought prompting can be improved by taking the majority final answer over many sampled generations (Wang et al., 2022a)). For LaMDA, we report averaged results over five random seeds, where each seed had a different randomly shuffled order of exemplars. As LaMDA experiments did not show large variance among different seeds, to save compute we report results for a single exemplar order for all other models. 3.2 Results The strongest results of chain-of-thought prompting are summarized in Figure 4, with all experimental outputs for each model collection, model size, and benchmark shown in Table 2 in the Appendix. There are three key takeaways. First, Figure 4 shows that chain-of-thought prompting is an emergent ability of model scale (Wei et al., 2022b). That is, chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting. 4 Second, chain-of-thought prompting has larger Standard prompting performance gains for more-complicated prob- Chain-of-thought prompting lems. For instance, for GSM8K (the dataset Prior supervised best with the lowest baseline performance), perfor- mance more than doubled for the largest GPT LaMDA GPT PaLM and PaLM models. On the other hand, for Sin- 60 gleOp, the easiest subset of MAWPS which only solve rate (%) requires a single step to solve, performance im- GSM8K 40 provements were either negative or very small (see Appendix Table 3). 20 Third, chain-of-thought prompting via GPT-3 0 175B and PaLM 540B compares favorably to prior state of the art, which typically finetunes a 80 solve rate (%) task-specific model on a labeled training dataset. 60 SVAMP Figure 4 shows how PaLM 540B uses chain-of- thought prompting to achieve new state of the art 40 on GSM8K, SVAMP, and MAWPS (though note 20 that standard prompting already passed the prior 0 best for SVAMP). On the other two datasets, AQuA and ASDiv, PaLM with chain-of-thought 100 prompting reaches within 2% of the state of the solve rate (%) 75 art (Appendix Table 2). MAWPS 50 To better understand why chain-of-thought prompting works, we manually examined model- 25 generated chains of thought by LaMDA 137B 0 for GSM8K. Of 50 random examples where the 0.4 8 137 0.4 7 175 8 62 540 model returned the correct final answer, all of the generated chains of thought were also log- Model scale (# parameters in billions) ically and mathematically correct except two that coincidentally arrived at the correct answer Figure 4: Chain-of-thought prompting enables (see Appendix D.1, and Table 8 for examples large language models to solve challenging math of correct model-generated chains of thought). problems. Notably, chain-of-thought reasoning We also randomly examined 50 random sam- is an emergent ability of increasing model scale. ples for which the model gave the wrong answer. Prior best numbers are from Cobbe et al. (2021) The summary of this analysis is that 46% of the for GSM8K, Jie et al. (2022) for SVAMP, and Lan chains of thought were almost correct, barring et al. (2021) for MAWPS. minor mistakes (calculator error, symbol map- ping error, or one reasoning step missing), and that the other 54% of the chains of thought had major errors in semantic understanding or coherence (see Appendix D.2). To provide a small insight into why scaling improves chain-of-thought reasoning ability, we performed a similar analysis of errors made by PaLM 62B and whether those errors were fixed by scaling to PaLM 540B. The summary is that scaling PaLM to 540B fixes a large portion of one-step missing and semantic understanding errors in the 62B model (see Appendix A.1). 3.3 Ablation Study The observed benefits of using chain-of-thought prompting raises the natural question of whether the same performance improvements can be conferred via other types of prompting. Figure 5 shows an ablation study with three variations of chain of thought described below. Equation only. One reason for why chain-of-thought prompting might help is that it produces the mathematical equation to be evaluated, and so we test a variation where the model is prompted to output only a mathematical equation before giving the answer. Figure 5 shows that equation only prompting does not help much for GSM8K, which implies that the semantics of the questions in GSM8K are too challenging to directly translate into an equation without the natural language reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find that equation only prompting does improve performance, since the equation can be easily derived from the question (see Appendix Table 6). 5 Variable compute only. Another intuition is that chain of Standard prompting thought allows the model to spend more computation (i.e., Equation only intermediate tokens) on harder problems. To isolate the effect Variable compute only of variable computation from chain-of-thought reasoning, we Reasoning after answer test a configuration where the model is prompted to output a Chain-of-thought prompting only sequence of dots (...) equal to the number of characters in GSM8K solve rate (%) the equation needed to solve the problem. This variant performs 60 about the same as the baseline, which suggests that variable computation by itself is not the reason for the success of chain- of-thought prompting, and that there appears to be utility from 40 expressing intermediate steps via natural language. 20 Chain of thought after answer. Another potential benefit of chain-of-thought prompting could simply be that such prompts allow the model to better access relevant knowledge acquired 0 during pretraining. Therefore, we test an alternative configura- LaMDA PaLM tion where the chain of thought prompt is only given after the answer, isolating whether the model actually depends on the Figure 5: Ablation study for dif- produced chain of thought to give the final answer. This variant ferent variations of prompting us- performs about the same as the baseline, which suggests that ing LaMDA 137B and PaLM 540B. the sequential reasoning embodied in the chain of thought is Results for other datasets are given useful for reasons beyond just activating knowledge. in Appendix Table 6 and Table 7. 3.4 Robustness of Chain of Thought Standard prompting Sensitivity to exemplars is a key consideration of prompt- Chain-of-thought prompting ing approaches—for instance, varying the permutation of · different annotator (B) few-shot exemplars can cause the accuracy of GPT-3 on · different annotator (C) SST-2 to range from near chance (54.3%) to near state of · intentionally concise style the art (93.4%) (Zhao et al., 2021). In this final subsec- · exemplars from GSM8K (α) tion, we evaluate robustness to chains of thought written · exemplars from GSM8K (β) by different annotators. In addition to the results above, · exemplars from GSM8K (γ) which used chains of thought written by an Annotator A, two other co-authors of this paper (Annotators B and 20 60 C) independently wrote chains of thought for the same Solve rate (%) few-shot exemplars (shown in Appendix H). Annotator A 15 also wrote another chain of thought that was more concise 40 than the original, following the style of solutions given in 10 Cobbe et al. (2021).1 20 Figure 6 shows these results for LaMDA 137B on GSM8K 5 and MAWPS (ablation results for other datasets are given in Appendix Table 6 / Table 7). Although there is variance 0 0 among different chain of thought annotations, as would be GSM8K MAWPS expected when using exemplar-based prompting (Le Scao and Rush, 2021; Reynolds and McDonell, 2021; Zhao Figure 6: Chain-of-thought prompting et al., 2021), all sets of chain of thought prompts outper- has variance for different prompt exam- form the standard baseline by a large margin. This result ples (as expected) but outperforms stan- implies that successful use of chain of thought does not dard prompting for various annotators as depend on a particular linguistic style. well as for different exemplars. To confirm that successful chain-of-thought prompting works for other sets of exemplars, we also run experiments with three sets of eight exemplars randomly sampled from the GSM8K training set, an independent 1 For instance, whereas original chain of thought uses several short sentences (“’There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29.”), the concise chain of thought would read “5 * 4 = 20 new computers were added. So there are 9 + 20 = 29 new computers in the server room now”. 6 source (examples in this dataset already included reasoning steps like a chain of thought).2 Fig- ure 6 shows that these prompts performed comparably with our manually written exemplars, also substantially outperforming standard prompting. In addition to robustness to annotators, independently-written chains of thought, different exemplars, and various language models, we also find that chain-of-thought prompting for arithmetic reasoning is robust to different exemplar orders and varying numbers of exemplars (see Appendix A.2). 4 Commonsense Reasoning Although chain of thought is particularly suitable for math word problems, the language-based nature of chain of thought actually makes it applicable to a broad class of commonsense reasoning problems, which involve reasoning about physical and human interactions under the presumption of general background knowledge. Commonsense reasoning is key for interacting with the world and is still beyond the reach of current natural language understanding systems (Talmor et al., 2021). Benchmarks. We consider five datasets covering a diverse range of commonsense reasoning types. The popular CSQA (Talmor et al., 2019) asks commonsense questions about the world involving complex semantics that often require prior knowledge. StrategyQA (Geva et al., 2021) requires models to infer a multi-hop strategy to answer questions. We choose two specialized evaluation sets from the BIG-bench effort (BIG-bench collaboration, 2021): Date Understanding, which involves inferring a date from a given context, and Sports Understanding, which involves determining whether a sentence relating to sports is plausible or implausible. Finally, the SayCan dataset (Ahn et al., 2022) involves mapping a natural language instruction to a sequence of robot actions from a discrete set. Figure 3 shows examples with chain of thought annotations for all datasets. Prompts. We follow the same experimental setup as the prior section. For CSQA and StrategyQA, we randomly selected examples from the training set and manually composed chains of thought for them to use as few-shot exemplars. The two BIG-bench tasks do not have training sets, so we selected the first ten examples as exemplars in the evaluation set as few-shot exemplars and report numbers on the rest of the evaluation set. For SayCan, we use six examples from the training set used in Ahn et al. (2022) and also manually composed chains of thought. Results. Figure 7 highlights these results for PaLM (full results for LaMDA, GPT-3, and different model scales are shown in Table 4). For all tasks, scaling up model size improved the performance of standard prompting; chain-of-thought prompting led to further gains, with improvements appear- ing to be largest for PaLM 540B. With chain-of-thought prompting, PaLM 540B achieved strong performance relative to baselines, outperforming the prior state of the art on StrategyQA (75.6% vs 69.4%) and outperforming an unaided sports enthusiast on sports understanding (95.4% vs 84%). These results demonstrate that chain-of-thought prompting can also improve performance on tasks requiring a range of commonsense reasoning abilities (though note that gain was minimal on CSQA). CSQA StrategyQA Date Sports SayCan 100 90 80 100 100 Solve rate (%) 80 80 60 80 Standard prompting 80 Chain of thought 60 70 40 60 Prior supervised best 60 40 60 20 40 Human 20 50 0 40 20 8 62 540 8 62 540 8 62 540 8 62 540 8 62 540 Model scale (# parameters in billions) Figure 7: Chain-of-thought prompting also improves the commonsense reasoning abilities of language models. The language model shown here is PaLM. Prior best numbers are from the leaderboards of CSQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) (single-model only, as of May 5, 2022). Additional results using various sizes of LaMDA, GPT-3, and PaLM are shown in Table 4. 2 We sample examples ≤ 60 tokens to fit into our input context window, and also limit the examples to ≤ 2 steps to solve for a fair comparison with the eight exemplars that we composed. 7 5 Symbolic Reasoning Standard prompting Chain-of-thought prompting Our final experimental evaluation considers symbolic rea- soning, which is simple for humans but potentially chal- Letter Concat: 2 Letter Concat: 4 lenging for language models. We show that chain-of- (in domain) (OOD) thought prompting not only enables language models to 100 Solve rate (%) perform symbolic reasoning tasks that are challenging in 75 the standard prompting setting, but also facilitates length 50 generalization to inference-time inputs longer than those seen in the few-shot exemplars. 25 0 Tasks. We use the following two toy tasks. Coin Flip: 2 Coin Flip: 4 Last letter concatenation. This task asks the model (in domain) (OOD) to concatenate the last letters of words in a name (e.g., 100 “Amy Brown” → “yn”). It is a more challenging version Solve rate (%) of first letter concatenation, which language models can 80 already perform without chain of thought.3 We generate full names by randomly concatenating names from the 60 top one-thousand first and last names from name census data (https://namecensus.com/). 40 Coin flip. This task asks the model to answer whether a 8 62 540 8 62 540 coin is still heads up after people either flip or don’t flip Model scale (# parameters in billions) the coin (e.g., “A coin is heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Is the coin still heads up?” Figure 8: Using chain-of-thought → “no”). prompting facilitates generalization to As the construction of these symbolic reasoning tasks is longer sequences in two symbolic rea- well-defined, for each task we consider an in-domain test soning tasks. set for which examples had the same number of steps as the training/few-shot exemplars, as well as an out-of-domain (OOD) test set, for which evaluation examples had more steps than those in the exemplars. For last letter concatenation, the model only sees exemplars of names with two words, and then performs last letter concatenation on names with 3 and 4 words.4 We do the same for the number of potential flips in the coin flip task. Our experimental setup uses the same methods and models as in the prior two sections. We again manually compose chains of thought for the few-shot exemplars for each task, which are given in Figure 3. Results. The results of these in-domain and OOD evaluations are shown in Figure 8 for PaLM, with results for LaMDA shown in Appendix Table 5. With PaLM 540B, chain-of-thought prompting leads to almost 100% solve rates (note that standard prompting already solves coin flip with PaLM 540, though not for LaMDA 137B). Note that these in-domain evaluations are “toy tasks” in the sense that perfect solution structures are already provided by the chains of thought in the few-shot exemplars; all the model has to do is repeat the same steps with the new symbols in the test-time example. And yet, small models still fail—the ability to perform abstract manipulations on unseen symbols for these three tasks only arises at the scale of 100B model parameters. As for the OOD evaluations, standard prompting fails for both tasks. With chain-of-thought prompting, language models achieve upward scaling curves (though performance is lower than in the in-domain setting). Hence, chain-of-thought prompting facilitates length generalization beyond seen chains of thought for language models of sufficient scale. 6 Discussion We have explored chain-of-thought prompting as a simple mechanism for eliciting multi-step rea- soning behavior in large language models. We first saw that chain-of-thought prompting improves performance by a large margin on arithmetic reasoning, yielding improvements that are much stronger than ablations and robust to different annotators, exemplars, and language models (Section 3). Next, 3 We tested 10 common names using GPT-3 davinci and it got all but one correct. 4 For names of length longer than 2 words, we concatenate multiple first and last names together. 8 experiments on commonsense reasoning underscored how the linguistic nature of chain-of-thought reasoning makes it generally applicable (Section 4). Finally, we showed that for symbolic reasoning, chain-of-thought prompting facilitates OOD generalization to longer sequence lengths (Section 5). In all experiments, chain-of-thought reasoning is elicited simply by prompting an off-the-shelf language model. No language models were finetuned in the process of writing this paper. The emergence of chain-of-thought reasoning as a result of model scale has been a prevailing theme (Wei et al., 2022b). For many reasoning tasks where standard prompting has a flat scaling curve, chain- of-thought prompting leads to dramatically increasing scaling curves. Chain-of-thought prompting appears to expand the set of tasks that large language models can perform successfully—in other words, our work underscores that standard prompting only provides a lower bound on the capabilities of large language models. This observation likely raises more questions than it answers—for instance, how much more can we expect reasoning ability to improve with a further increase in model scale? What other prompting methods might expand the range of tasks that language models can solve? As for limitations, we first qualify that although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning,” which we leave as an open question. Second, although the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for finetuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalization). Third, there is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers; improving factual generations of language models is an open direction for future work (Rashkin et al., 2021; Ye and Durrett, 2022; Wiegreffe et al., 2022, inter alia). Finally, the emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications; further research could explore how to induce reasoning in smaller models. 7 Related Work This work is inspired by many research areas, which we detail in an extended related work section (Appendix C). Here we describe two directions and associated papers that are perhaps most relevant. The first relevant direction is using intermediate steps to solve reasoning problems. Ling et al. (2017) pioneer the idea of using natural language rationales to solve math word problems through a series of intermediate steps. Their work is a remarkable contrast to the literature using formal languages to reason (Roy et al., 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Cobbe et al. (2021) extend Ling et al. (2017) by creating a larger dataset and using it to finetune a pretrained language model rather than training a model from scratch. In the domain of program synthesis, Nye et al. (2021) leverage language models to predict the final outputs of Python programs via first line-to-line predicting the intermediate computational results, and show that their step-by-step prediction method performs better than directly predicting the final outputs. Naturally, this paper also relates closely to the large body of recent work on prompting. Since the popularization of few-shot prompting as given by Brown et al. (2020), several general approaches have improved the prompting ability of models, such as automatically learning prompts (Lester et al., 2021) or giving models instructions describing a task (Wei et al., 2022a; Sanh et al., 2022; Ouyang et al., 2022). Whereas these approaches improve or augment the input part of the prompt (e.g., instructions that are prepended to inputs), our work takes the orthogonal direction of augmenting the outputs of language models with a chain of thought. 8 Conclusions We have explored chain-of-thought prompting as a simple and broadly applicable method for enhanc- ing reasoning in language models. Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain-of-thought reasoning is an emergent property of model scale that allows sufficiently large language models to perform reasoning tasks that otherwise have flat scaling curves. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning. 9 Acknowledgements We thank Jacob Devlin, Claire Cui, Andrew Dai, and Ellie Pavlick for providing feedback on the paper. We thank Jacob Austin, Yuhuai Wu, Henryk Michalewski, Aitor Lewkowycz, Charles Sutton, and Aakanksha Chowdhery for helpful discussions. We thank Sid Maxwell for notifying us about a mistake in the manual error analysis in the original manuscript. References Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. 2022. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691. Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpretable math word problem solving with operation- based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics. Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving BERT a calculator: Finding operations and arguments with reading comprehension. EMNLP. Jacob Andreas, Dan Klein, and Sergey Levine. 2018. Learning with latent language. NAACL. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732. BIG-bench collaboration. 2021. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation. Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. Flexible generation of natural language deductions. EMNLP. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. NeurIPS. Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. ICLR. Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. NeurIPS. Howard Chen, Jacqueline He, Karthik Narasimhan, and Danqi Chen. 2022. Can rationalization improve robustness? NAACL. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. ICLR. Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. In Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2656–2668, Minneapolis, Minnesota. Association for Computational Linguistics. 10 Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. IJCAI. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL. Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. 2019. Neural logic machines. ICLR. Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. Benefits of intermediate annotations in reading comprehension. ACL. Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. TACL. Yuling Gu, Bhavana Dalvi Mishra, and Peter Clark. 2022. DREAM: Uncovering mental models behind language models. NAACL. Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher Ré. 2018. Training classifiers with natural language explanations. ACL. Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. ACL. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. EMNLP. Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learning to reason deductively: Math word problem solving as complex relation extraction. arXiv preprint arXiv:2203.10316. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. NAACL. Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. 2022. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329. Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. 2021. MWPToolkit: An open-source framework for deep learning-based math word problem solvers. arXiv preprint arXiv:2109.00799. Teven Le Scao and Alexander Rush. 2021. How many data points is a prompt worth? NAACL. Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. EMNLP. Iddo Lev, Bill MacCartney, Christopher Manning, and Roger Levy. 2004. Solving logic puzzles: From robust processing to precise semantics. Proceedings of the 2nd Workshop on Text Meaning and Interpretation. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. ACL. 11 Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. Explainable multi-hop verbal reasoning through internal monologue. NAACL. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586. Bodhisattwa Prasad Majumder, Oana-Maria Camburu, Thomas Lukasiewicz, and Julian McAuley. 2021. Rationale-inspired natural language explanations with commonsense. arXiv preprint arXiv:2106.13876. Ana Marasović, Iz Beltagy, Doug Downey, and Matthew E Peters. 2022. Few-shot self-rationalization with natural language prompts. NAACL Findings. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In ACL. Shen Yun Miao, Chao Chun Liang, and Keh Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. ACL. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837. Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. WT5?! Training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? NAACL. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL. Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, and Weizhu Chen. 2022. Reasoning like program executors. arXiv preprint arXiv:2201.11473. Piotr Pi˛ekos, Mateusz Malinowski, and Henryk Michalewski. 2021. Measuring and improving BERT’s mathematical abilities by predicting the order of reasoning. ACL. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67. Dheeraj Rajagopal, Vidhisha Balachandran, Eduard H. Hovy, and Yulia Tsvetkov. 2021. SelfExplain: A self-explaining architecture for neural text classifiers. EMNLP. Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! Leveraging language models for commonsense reasoning. ACL. 12 Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. NumNet: Machine reading comprehension with numerical reasoning. EMNLP. Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2021. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870. Gabriel Recchia. 2021. Teaching autoregressive language models complex tasks by demonstration. arXiv preprint arXiv:2109.02102. Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. A recipe for arbitrary text style transfer with large language models. ACL. Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP. Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about Quantities in Natural Language. TACL. Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. 2021. RuleBERT: Teaching soft rules to pre-trained language models. EMNLP. Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask prompted training enables zero-shot task generalization. ICLR. Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. Generate & rank: A multi-task framework for math word problems. In Findings of the Association for Computational Linguistics: EMNLP 2021. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. NAACL. Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. 2020. Leap-of- thought: Teaching pre-trained models to systematically reason over implicit knowledge. NeurIPS. Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. 2021. CommonsenseQA 2.0: Exposing the limits of ai through gamification. NeurIPS Track on Datasets and Benchmarks. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. 2022. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022a. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022b. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022a. Finetuned language models are zero-shot learners. ICLR. 13 Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022b. Emergent abilities of large language models. Transactions on Machine Learning Research. Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. 2022. Reframing human-AI collaboration for generating free-text explanations. NAACL. Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable NLP. NeurIPS. Sarah Wiegreffe, Ana Marasović, and Noah A. Smith. 2021. Measuring association between labels and free-text rationales. EMNLP. Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022a. PromptChainer: Chaining large language model prompts through visual programming. CHI Extended Abstracts. Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022b. AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. CHI. Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. 2020. Neural execution engines: Learning to execute subroutines. NeurIPS. Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. 2021. Refining language models with compositional explanations. NeurIPS. Xi Ye and Greg Durrett. 2022. The unreliability of explanations in few-shot in-context learning. arXiv preprint arXiv:2205.03401. Yordan Yordanov, Vid Kocijan, Thomas Lukasiewicz, and Oana-Maria Camburu. 2021. Few-shot out-of-domain transfer learning of natural language explanations. arXiv preprint arXiv:2112.06204. Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using “annotator rationales” to improve machine learning for text categorization. NAACL. Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615. Eric Zelikman, Yuhuai Wu, and Noah D. Goodman. 2022. STaR: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. ICML. Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, and Jian Tang. 2020. Towards interpretable natural language understanding with explanations as latent variables. NeurIPS. 14 Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6 and Appendix A.2. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We don’t expect negative societal impacts as a direct result of the contributions in our paper. One consideration, however, is that generated chain of thought is not always factual, which is noted as a limitation in Appendix D.1 (and note that we do not suggest using such chains of thought in a factual manner or in any real-world scenario). (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] We included inputs, outputs, and targets for LaMDA and GPT-3 in the supplementary material. Although we use proprietary models, we GPT-3 results are fully reproducible. Repro- ducibility is further discussed in Appendix E.1. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Data splits were specified, N/A for hyperparams. (c) Did you report error bars (e.g., with respect to the random seed after running exper- iments multiple times)? [Yes] Standard deviation for multiple seeds using LaMDA 137B, where each seed is a different random order of exemplars, is given in Table 6 and Table 7. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Type of resources are described in Appendix E.2, though we did not estimate the total amount of compute. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] We used two models that we anonymized based on the recommendation of the NeurIPS chairs. These models will be cited in the camera-ready version of the paper. (b) Did you mention the license of the assets? [Yes] See Appendix E.3. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] The coinflip and last letter concatenation datasets are the only new assets, and they are given in the Supplementary Materials. (d) Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A] No human data collected. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] No human data collected. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] 15 A Frequently Asked Questions A.1 Why does increasing model scale improve chain-of-thought prompting? The finding that successful chain-of-thought reasoning predictably emerges only at certain model scales is intriguing. Scaling up language models has been shown to confer benefits such as improved performance and sample efficiency (Kaplan et al., 2020), but chain-of-thought reasoning is emergent in the sense that its success cannot be predicted only by extrapolating the performance of small scale models, as chain of thought actually hurts performance for most models smaller than 10B parameters. The question of why model scale improves chain-of-thought prompting is certainly multi-faceted, and we made a preliminary attempt to shed insight into it via error analysis. This small analysis involved manually reading 45 errors made by PaLM 62B and categorizing them into semantic understanding (20 errors), one step missing (18 errors), and other errors (7 errors). The “other category” included hallucinations, repetitive outputs, and symbol mapping errors. This categorization is a coarse one borrowed from the initial error analysis done on LaMDA in Appendix D.2, for which categories were conceived based on what improvements were needed to make the chain of thought correct. As shown in Figure 9, scaling PaLM to 540B parameters fixed a substantial portion of errors in all three categories. Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM to 540B are given in Figure 10. This result appears consistent with a hypothesis that language models acquire a range of semantic understanding and logical reasoning skills as a function of model scale (though note that model scale is often conflated with other factors, such as amount of training compute). Types of errors made by a 62B language model: Semantic understanding (62B made 20 errors of this type, 540B xes 6 of them) One step missing (62B made 18 errors of this type, 540B xes 12 of them) Errors xed by Other scaling from (62B made 7 errors of this type, 540B xes 4 of them) 62B to 540B fi fi Figure 9: Error analysis of 45 problems that PaLM 62B got incorrect. These errors were categorized that semantic understanding, one step missing, and other. The other category includes hallucinations, repetitive outputs, and symbol mapping errors. Scaling PaLM to 540B fixed a substantial portion of errors in all categories. There are also three notable points regarding why small language models fail. The first observation is that small language models fail at even relatively easy symbol mapping tasks. As demonstrated in Section 5, for even symbolic reasoning tasks that only require generalization to new examples using the same chain of thought logical structure that was given in the few-shot exemplars, small language models still failed. The second observation is that small language models seem to have inherently weaker arithmetic abilities, as shown by Brown et al. (2020), the ability to do simple arithmetic operations (without semantic understanding) requires sufficient model scale. Finally, we noticed qualitatively that small language models often did not generate a final answer that could be parsed, due to either repetitions or logic that never arrived at a final answer. In summary, the success of chain-of-thought reasoning as a result of model scale is a complicated phenomena that likely involves a variety of emergent abilities (semantic understanding, symbol mapping, staying on topic, arithmetic ability, faithfulness, etc). Future work could more thoroughly investigate what properties of pretraining data, model architecture, and optimization objective causally enable such reasoning capabilities. 16 Question 62B Model Output 540B Model Output Tracy used a piece of wire 4 She cut the wire into 6 inch pieces. The wire was 4 feet long. This feet long to support tomato This means she got 4 * 6 = 24 means it was 4 * 12 = 48 inches plants in the garden. The pieces. The answer is 24. long. It was cut into pieces 6 inches wire was cut into pieces 6 (Incorrect: semantic long. This means she obtained 48 / inches long. How many understanding error) 6 = 8 pieces. The answer is 8. pieces did she obtain? (Correct) Question 62B Model Output 540B Model Output Tom's ship can travel at 10 He travels from 1 to 4 PM. That He travels at 10 mph from 1 to 4 miles per hour. He is sailing means he travels for 3 hours. He PM. This means he travels 3 hours. from 1 to 4 PM. He then travels back at 6 mph. That means 3 hours at 10 mph means he travels travels back at a rate of 6 he travels for 3 * 6 = 18 miles. The 3 * 10 = 30 miles. He then travels mph. How long does it take answer is 18. (Incorrect: semantic back at 6 mph. This means he him to get back? understanding error) travels 6 miles per hour. He has to travel 30 miles, so it takes him 30 / 6 = 5 hours. The answer is 5. (Correct) Question 62B Model Output 540B Model Output Stephen placed an online The final price of Stephen's Stephen's bill was $40.00. The order for groceries. His final groceries was $40.00. The delivery delivery vendor added 25% to the bill came to $40.00. vendor added 25% to the final total, bill. This means the bill was Because this was through a so the final total was 125% of increased by 25% of $40.00. This is delivery vendor, they tacked $40.00. This means the final total 0.25 * $40.00 = $10.00. The delivery on a 25% fee to his final was 125% * $40.00 = $50.00. The fee was $3.00. The tip was $4.00. total and charged him $3.00 delivery vendor also charged $3.00 So the total bill was $40.00 + $10.00 in delivery fees. Stephen in delivery fees. So the final price of + $3.00 + 540b model $4.00 = also added a $4.00 tip. After Stephen's groceries was $50.00 + $57.00. The answer is 57. (Correct) the extra fees, what was the $3.00 = $53.00. The answer is final price of Stephen's $53.00. (Incorrect: one step groceries? missing error) Question 62B Model Output 540B Model Output There are four schools There are 4 schools. Each school Each school has sent 2 teams, each competing at a basketball has sent a girls’ basketball team and with 5 players. This means each tournament. Each school a boys’ basketball team. Each team school has sent 2 * 5 = 10 players. has sent a girls’ basketball has 5 players each. That means Each school has also sent 2 team and a boys’ basketball there are 4 * 2 * 5 = 40 players. coaches. This means each school team and each team has 5 Each team has a coach. That has sent 10 + 2 = 12 people. There players each. Each school means there are 40 + 4 = 44 are 4 schools, so in total all of the has also sent a coach for coaches. The answer is 44. schools have sent 4 * 12 = 48 each team. In total, how (Incorrect: one step missing people. The answer is 48. (Correct) many people have all of the error) schools sent? Figure 10: Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM from 62B to 540B. A.2 What is the role of prompt engineering? One of the key considerations of prompting is sensitivity to the exact prompt. There is no shortage of work showing that prompts affect language models in unexpected ways (Min et al., 2022). The general way that we created chain of thought annotations was by taking eight exemplars from the training set and decomposing the reasoning process into multiple steps leading to the final answer. Examples of chain of thought annotations are provided in Figure 3, with full prompts given in Appendix G. To analyze how sensitive chain of thought is to prompt engineering, we performed robustness experiments with respect to various factors. Different annotators. We first analyze robustness to three different annotators (Section 3.4 and Figure 6). Although there is notable variance in performance (which we will discuss later), chain of thought performed better than the baseline by a large margin for all three annotators on eight datasets in arithmetic, commonsense, and symbolic reasoning (Table 6 and Table 7). Similar to the annotation process in Cobbe et al. (2021), annotators were not given specific instructions about 17 how to write the chain of thought annotations other than to simply write the step-by-step reasoning process that led to the final answer. Thus, the annotations were written in each annotator’s own linguistic “chain of thought” writing style. Annotators without machine learning background. The GSM8K dataset (Cobbe et al., 2021) conveniently provides a training set with reasoning chains written by crowd compute workers, which enables us to investigate whether chain of thought still works with reasoning chains from an independent source without a background in machine learning. So we randomly sampled three sets of eight exemplars with chains of thought from GSM8K. These chain of thought annotations also outperformed the baseline by a large margin for all four arithmetic datasets (Table 6), indicating that chain of thought is not dependent on a particular set of annotators. Different exemplars. The different GSM8K exemplars experiment above (Table 6) also shows that chain-of-thought prompting works for different sets of exemplars. Notably, we test every set of exemplars on all four arithmetic datasets (instead of picking exemplars from the training set for each dataset), which suggests that the exemplars do not necessarily have to come from the same dataset distribution as the test examples. Different order of exemplars. Prior work has shown that in some cases (e.g., classification) even the order of prompts matter—varying the permutation of few-shot exemplars can cause the accuracy of GPT-3 on SST-2 to range from near chance (54.3%) to near SOTA (93.4%) (Zhao et al., 2021). We show the standard deviation of performance from different exemplars in Table 6 and Table 7. Standard deviations with respect to prompt order are relatively minimal in almost all cases. The one exception is the coin flip task, for which exemplar orders have high standard deviation, likely for the reason cited in Zhao et al. (2021)—for classification, many exemplars of the same category in a row biases the model outputs). Different number of exemplars. We also found that gains from chain-of-thought prompting generally still held when there was a varying number of few-shot exemplars. This is shown for five datasets in Figure 11 (we did not have the compute to run this for all datasets). We also found in preliminary experiments that further increasing the number of exemplars in standard prompting did not lead to significant gains (e.g., increasing from 8 to 16 exemplars did not improve the performance of standard prompting enough to catch up with chain-of-thought prompting). Different language models. Another interesting question is whether certain prompts that work better for one model work better for other large language models. We find that with the same prompts, chain-of-thought prompting improves performance across all three models (LaMDA, GPT-3, and PaLM) for all datasets except CSQA and StrategyQA for GPT-3 (Table 1, Table 4, Table 5). The fact that gains from chain of thought did not transfer perfectly among models is a limitation; further work could investigate why how different pre-training datasets and model architectures affect the performance gain from chain-of-thought prompting. Prompt engineering still matters, though. Although the results are relatively robust to the prompt for arithmetic reasoning, we want to be clear that prompt engineering still does matter, and can improve performance significantly in many cases. Though most chain of thought annotations outperform standard prompting, there is large variation in many cases. For instance, for the coin flip task, the performance varied from 99.6% for Annotator A to 71.4% for Annotator C, though both were above standard prompting = 50.0% (see Table 7). There are even tasks where prompt engineering is a requirement for good performance. In preliminary experiments, we tried using chain of thought to enable language models to reverse the order of a list of 5 items. While two co-authors were not able to write chain of thought prompts that solved the task despite their best attempts, a third co-author was able to write a chain of thought that perfectly solved the task. How to generate chain of thought annotations in a robust fashion could be an interesting direction for future work. For instance, an idea here could be to use a large language model to automatically generate chains of thought via prompting (and potentially optimize this over a validation set). A.3 Will chain-of-thought prompting improve performance for my task of interest? While chain-of-thought prompting is in principle applicable for any text-to-text task, it is more helpful for some tasks than others. Based on the experiments in this paper, our intuition is that chain of thought helps the most when three conditions are met: (1) the task is challenging and requires 18 multi-step reasoning, (2) a large language model is used, and (3) the scaling curve is relatively flat. Conversely, the benefits are smaller when one or more of these conditions are not met. These intuitions are perhaps supported by the arithmetic reasoning results. The performance gain from chain-of-thought prompting is largest for PaLM 540B on GSM8K (challenging multi-step problems, flat scaling curve), which meets these conditions. The performance gain is small for the subsets of MAWPS that only require one or two steps (SingleOP, SingleEq, and AddSub), for which PaLM 540B already achieves performance of 90% or higher (and it is also generally true that there is less headroom for improvement when performance is already strong). Although in this paper we focused on multi-step reasoning tasks (arithmetic, commonsense, and symbolic), chain-of-thought prompting can potentially be applied to any task for which humans use a “chain of thought” to solve (at least in principle). We leave the empirical evaluation of chain-of-thought prompting on such diverse tasks (e.g., machine translation, etc.) to future work. A.4 Why is prompting with the equation only not enough for some arithmetic reasoning datasets? Prompting with the equation only as an intermediate step does help on many datasets, especially when the datasets only require a few reasoning steps (SVAMP, ASDiv, MAWPS). For GSM8K, however, using the equation only did not improve performance substantially. Based on qualitative analysis, we believe that these questions are too semantically challenging for the model to directly translate them into a math equation. Consider this example from LaMDA 137B: Q UESTION : Mike plays ping pong for 40 minutes. In the first 20 minutes, he scores 4 points. In the second 20 minutes, he scores 25% more points. How many total points did he score? E QUATION ONLY ( WRONG ANSWER ): (4 + 20 * 0.25) = 6. The answer is 6. C HAIN OF THOUGHT ( CORRECT ): Mike played ping pong for 40 minutes. In the first 20 minutes, he scored 4 points. In the second 20 minutes, he scored 25% more points. So he scored 25% more in the second 20 minutes. 4 x 1.25 = 5. So he scored 5 points in the second 20 minutes. So he scored 9 points in total. The answer is 9. It is hard for the model to directly translate all of the semantics into a single equation, but chain of thought allows it to better reason about each part of the question via intermediate steps in natural language. 19 B All Experimental Results This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting. For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model performed an arithmetic operation incorrectly. A similar observation was made in Cobbe et al. (2021). Hence, we can further add a Python program as an external calculator (using the Python eval function) to all the equations in the generated chain of thought. When there are multiple equations in a chain of thought, we propagate the external calculator results from one equation to the following equations via string matching. As shown in Table 1, we see that adding a calculator significantly boosts performance of chain-of-thought prompting on most tasks. Table 1: Chain of thought prompting outperforms standard prompting for various large language models on five arithmetic reasoning benchmarks. All metrics are accuracy (%). Ext. calc.: post-hoc external calculator for arithmetic computations only. Prior best numbers are from the following. a: Cobbe et al. (2021). b & e: Pi et al. (2022), c: Lan et al. (2021), d: Pi˛ekos et al. (2021). Prompting GSM8K SVAMP ASDiv AQuA MAWPS Prior best N/A (finetuning) 55a 57.4b 75.3c 37.9d 88.4e UL2 20B Standard 4.1 10.1 16.0 20.5 16.6 Chain of thought 4.4 (+0.3) 12.5 (+2.4) 16.9 (+0.9) 23.6 (+3.1) 19.1 (+2.5) + ext. calc 6.9 28.3 34.3 23.6 42.7 LaMDA 137B Standard 6.5 29.5 40.1 25.5 43.2 Chain of thought 14.3 (+7.8) 37.5 (+8.0) 46.6 (+6.5) 20.6 (-4.9) 57.9 (+14.7) + ext. calc 17.8 42.1 53.4 20.6 69.3 GPT-3 175B Standard 15.6 65.7 70.3 24.8 72.7 (text-davinci-002) Chain of thought 46.9 (+31.3) 68.9 (+3.2) 71.3 (+1.0) 35.8 (+11.0) 87.1 (+14.4) + ext. calc 49.6 70.3 71.1 35.8 87.5 Codex Standard 19.7 69.9 74.0 29.5 78.7 (code-davinci-002) Chain of thought 63.1 (+43.4) 76.4 (+6.5) 80.4 (+6.4) 45.3 (+15.8) 92.6 (+13.9) + ext. calc 65.4 77.0 80.0 45.3 93.3 PaLM 540B Standard 17.9 69.4 72.1 25.2 79.2 Chain of thought 56.9 (+39.0) 79.0 (+9.6) 73.9 (+1.8) 35.8 (+10.6) 93.3 (+14.2) + ext. calc 58.6 79.8 72.6 35.8 93.5 20 Table 2: Standard prompting versus chain of thought prompting on five arithmetic reasoning bench- marks. Note that chain of thought prompting is an emergent ability of model scale—it does not positively impact performance until used with a model of sufficient scale. GSM8K SVAMP ASDiv AQuA MAWPS Model standard CoT standard CoT standard CoT standard CoT standard CoT UL2 20B 4.1 4.4 10.1 12.5 16.0 16.9 20.5 23.6 16.6 19.1 LaMDA 420M 2.6 0.4 2.5 1.6 3.2 0.8 23.5 8.3 3.2 0.9 2B 3.6 1.9 3.3 2.4 4.1 3.8 22.9 17.7 3.9 3.1 8B 3.2 1.6 4.3 3.4 5.9 5.0 22.8 18.6 5.3 4.8 68B 5.7 8.2 13.6 18.8 21.8 23.1 22.3 20.2 21.6 30.6 137B 6.5 14.3 29.5 37.5 40.1 46.6 25.5 20.6 43.2 57.9 GPT 350M 2.2 0.5 1.4 0.8 2.1 0.8 18.1 8.7 2.4 1.1 1.3B 2.4 0.5 1.5 1.7 2.6 1.4 12.6 4.3 3.1 1.7 6.7B 4.0 2.4 6.1 3.1 8.6 3.6 15.4 13.4 8.8 3.5 175B 15.6 46.9 65.7 68.9 70.3 71.3 24.8 35.8 72.7 87.1 Codex - 19.7 63.1 69.9 76.4 74.0 80.4 29.5 45.3 78.7 92.6 PaLM 8B 4.9 4.1 15.1 16.8 23.7 25.2 19.3 21.7 26.2 30.5 62B 9.6 29.9 48.2 46.7 58.7 61.9 25.6 22.4 61.8 80.3 540B 17.9 56.9 69.4 79.0 72.1 73.9 25.2 35.8 79.2 93.3 Table 3: Standard prompting versus chain of thought prompting on the four subsets of the MAWPS benchmark. The point of stratifying the MAWPS benchmark is to show that performance gains are minimal on easy one-step or two-step problems where large language models already achieve high performance (e.g., SingleOp, SingleEq, and AddSub). SingleOp SingleEq AddSub MultiArith Model standard CoT standard CoT standard CoT standard CoT UL2 20B 24.9 27.2 18.0 20.2 18.5 18.2 5.0 10.7 LaMDA 420M 2.8 1.0 2.4 0.4 1.9 0.7 5.8 1.5 2B 4.6 4.1 2.4 3.3 2.7 3.2 5.8 1.8 8B 8.0 7.0 4.5 4.4 3.4 5.2 5.2 2.4 68B 36.5 40.8 23.9 26.0 17.3 23.2 8.7 32.4 137B 73.2 76.2 48.8 58.7 43.0 51.9 7.6 44.9 GPT 350M 3.2 1.8 2.0 0.2 2.0 1.5 2.3 0.8 1.3B 5.3 3.0 2.4 1.6 2.3 1.5 2.2 0.5 6.7B 13.5 3.9 8.7 4.9 8.6 2.5 4.5 2.8 175B 90.9 88.8 82.7 86.6 83.3 81.3 33.8 91.7 Codex - 93.1 91.8 86.8 93.1 90.9 89.1 44.0 96.2 PaLM 8B 41.8 46.6 29.5 28.2 29.4 31.4 4.2 15.8 62B 87.9 85.6 77.2 83.5 74.7 78.2 7.3 73.7 540B 94.1 94.1 86.5 92.3 93.9 91.9 42.2 94.7 21 Table 4: Standard prompting versus chain of thought prompting on five commonsense reasoning benchmarks. Chain of thought prompting is an emergent ability of model scale—it does not positively impact performance until used with a model of sufficient scale. CSQA StrategyQA Date Sports SayCan Model standard CoT standard CoT standard CoT standard CoT standard CoT UL2 20B 34.2 51.4 59.0 53.3 13.5 14.0 57.9 65.3 20.0 41.7 LaMDA 420M 20.1 19.2 46.4 24.9 1.9 1.6 50.0 49.7 7.5 7.5 2B 20.2 19.6 52.6 45.2 8.0 6.8 49.3 57.5 8.3 8.3 8B 19.0 20.3 54.1 46.8 9.5 5.4 50.0 52.1 28.3 33.3 68B 37.0 44.1 59.6 62.2 15.5 18.6 55.2 77.5 35.0 42.5 137B 53.6 57.9 62.4 65.4 21.5 26.8 59.5 85.8 43.3 46.6 GPT 350M 14.7 15.2 20.6 0.9 4.3 0.9 33.8 41.6 12.5 0.8 1.3B 12.0 19.2 45.8 35.7 4.0 1.4 0.0 26.9 20.8 9.2 6.7B 19.0 24.0 53.6 50.0 8.9 4.9 0.0 4.4 17.5 35.0 175B 79.5 73.5 65.9 65.4 43.8 52.1 69.6 82.4 81.7 87.5 Codex - 82.3 77.9 67.1 73.2 49.0 64.8 71.7 98.5 85.8 88.3 PaLM 8B 19.8 24.9 55.6 53.5 12.9 13.1 55.1 75.2 34.2 40.0 62B 65.4 68.1 58.4 63.4 29.8 44.7 72.1 93.6 65.8 70.0 540B 78.1 79.9 68.6 77.8 49.0 65.3 80.5 95.4 80.8 91.7 Table 5: Standard prompting versus chain of thought prompting enables length generalization to longer inference examples on two symbolic manipulation tasks. Last Letter Concatenation Coin Flip (state tracking) 2 OOD: 3 OOD: 4 2 OOD: 3 OOD: 4 Model standard CoT standard CoT standard CoT standard CoT standard CoT standard CoT UL2 20B 0.6 18.8 0.0 0.2 0.0 0.0 70.4 67.1 51.6 52.2 48.7 50.4 LaMDA 420M 0.3 1.6 0.0 0.0 0.0 0.0 52.9 49.6 50.0 50.5 49.5 49.1 2B 2.3 6.0 0.0 0.0 0.0 0.0 54.9 55.3 47.4 48.7 49.8 50.2 8B 1.5 11.5 0.0 0.0 0.0 0.0 52.9 55.5 48.2 49.6 51.2 50.6 68B 4.4 52.0 0.0 0.8 0.0 2.5 56.2 83.2 50.4 69.1 50.9 59.6 137B 5.8 77.5 0.0 34.4 0.0 13.5 49.0 99.6 50.7 91.0 49.1 74.5 PaLM 8B 2.6 18.8 0.0 0.0 0.0 0.2 60.0 74.4 47.3 57.1 50.9 51.8 62B 6.8 85.0 0.0 59.6 0.0 13.4 91.4 96.8 43.9 91.0 38.3 72.4 540B 7.6 99.4 0.2 94.8 0.0 63.0 98.1 100.0 49.3 98.6 54.8 90.2 22 Table 6: Ablation and robustness results for arithmetic reasoning datasets. Chain of thought generally outperforms ablations by a large amount. “Equation only” performs in between standard prompting and chain of thought prompting, as it allows for intermediate reasoning steps via equations but does not leverage natural language. Chain of thought prompting has variance (as expected) when used with prompts written by different annotators or when using other exemplars, but still outperforms standard prompting by a large margin. Standard deviation shown is for different order of few-shot prompting exemplars, with five different random seeds. Results here are shown for LaMDA 137B, as additional queries for GPT-3 and PaLM are both limited and expensive. GSM8K SVAMP ASDiv MAWPS Standard prompting 6.5 ±0.4 29.5 ±0.6 40.1 ±0.6 43.2 ±0.9 Chain of thought prompting 14.3 ±0.4 36.7 ±0.4 46.6 ±0.7 57.9 ±1.5 Ablations · equation only 5.4 ±0.2 35.1 ±0.4 45.9 ±0.6 50.1 ±1.0 · variable compute only 6.4 ±0.3 28.0 ±0.6 39.4 ±0.4 41.3 ±1.1 · reasoning after answer 6.1 ±0.4 30.7 ±0.9 38.6 ±0.6 43.6 ±1.0 Robustness · different annotator (B) 15.5 ±0.6 35.2 ±0.4 46.5 ±0.4 58.2 ±1.0 · different annotator (C) 17.6 ±1.0 37.5 ±2.0 48.7 ±0.7 60.1 ±2.0 · intentionally concise style 11.1 ±0.3 38.7 ±0.8 48.0 ±0.3 59.6 ±0.7 · exemplars from GSM8K (α) 12.6 ±0.6 32.8 ±1.1 44.1 ±0.9 53.9 ±1.1 · exemplars from GSM8K (β) 12.7 ±0.5 34.8 ±1.1 46.9 ±0.6 60.9 ±0.8 · exemplars from GSM8K (γ) 12.6 ±0.7 35.6 ±0.5 44.4 ±2.6 54.2 ±4.7 Table 7: Ablation and robustness results for four datasets in commonsense and symbolic reasoning. Chain of thought generally outperforms ablations by a large amount. Chain of thought prompting has variance (as expected) when used with prompts written by different annotators or when using other exemplars, but still outperforms standard prompting by a large margin. Standard deviation shown is for different order of few-shot prompting exemplars, with five different random seeds. Results here are shown for LaMDA 137B, as additional queries for GPT-3 and PaLM are both limited and expensive. The exception is that we run SayCan using PaLM here, as the SayCan evaluation set is only 120 examples and therefore less expensive to run multiple times. Commonsense Symbolic Date Sports SayCan Concat Coin Standard prompting 21.5 ±0.6 59.5 ±3.0 80.8 ±1.8 5.8 ±0.6 49.0 ±2.1 Chain of thought prompting 26.8 ±2.1 85.8 ±1.8 91.7 ±1.4 77.5 ±3.8 99.6 ±0.3 Ablations · variable compute only 21.3 ±0.7 61.6 ±2.2 74.2 ±2.3 7.2 ±1.6 50.7 ±0.7 · reasoning after answer 20.9 ±1.0 63.0 ±2.0 83.3 ±0.6 0.0 ±0.0 50.2 ±0.5 Robustness · different annotator (B) 27.4 ±1.7 75.4 ±2.7 88.3 ±1.4 76.0 ±1.9 77.5 ±7.9 · different annotator (C) 25.5 ±2.5 81.1 ±3.6 85.0 ±1.8 68.1 ±2.2 71.4 ±11.1 23 C Extended Related Work Chain-of-thought prompting is a general approach that is inspired by several prior directions: prompt- ing, natural language explanations, program synthesis/execution, numeric and logical reasoning, and intermediate language steps. C.1 Prompting The recent success of large-scale language models has led to growing interest in improving their capability to perform tasks via prompting (Brown et al. (2020), and see Liu et al. (2021) for a survey). This paper falls in the category of general prompting approaches, whereby input prompts are optimized to allow a single large language model to better perform a variety of tasks (Li and Liang, 2021; Lester et al., 2021; Reif et al., 2022, inter alia). One recent line of work aims to improve the ability of language models to perform a task by providing instructions that describe the task (Raffel et al., 2020; Wei et al., 2022a; Ouyang et al., 2022; Sanh et al., 2022; Wang et al., 2022b). This line of work is related because it also augments input–output pairs with meta-data. But whereas an instruction augments the input to a task (instructions are typically prepended to the inputs), chain-of-thought prompting augments the outputs of language models. Another related direction is sequentially combining the outputs of language models; human–computer interaction (HCI) work (Wu et al., 2022a,b) has shown that combining sequential generations of language models improves task outcomes in a 20-person user study. C.2 Natural language explanations Another closely related direction uses natural language explanations (NLEs), often with the goal of improving model interpretability (Zhou et al., 2020; Wiegreffe and Marasović, 2021, inter alia). That line of work typically focuses on natural language inference (Camburu et al., 2018; Yordanov et al., 2021; Bostrom et al., 2021), and produces explanations either simultaneously to or after the final predi