Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks PDF

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks Tiedong Liu Bryan Kian Hsiang Low National University of Singapore Nationa...

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks Tiedong Liu Bryan Kian Hsiang Low National University of Singapore National University of Singapore [email protected] [email protected] Abstract Goat-7B We introduce Goat, a fine-tuned LLaMA model 3978640188 + 42886272 = that significantly outperforms GPT-4 on a range 3978640188 + 42886272 = 4021526460 arXiv:2305.14201v1 [cs.LG] 23 May 2023 of arithmetic tasks. Fine-tuned on a syntheti- cally generated dataset, Goat achieves state-of- 4523646 minus 67453156 the-art performance on BIG-bench arithmetic 4523646 − 67453156 = −62929510 sub-task. In particular, the zero-shot Goat- 7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Sur- Calculate 397 x 4429 prisingly, Goat can achieve near-perfect ac- 397 × 4429 curacy on large-number addition and subtrac- = 4429 × (300 + 90 + 7) tion through supervised fine-tuning only, which is almost impossible with previous pretrained = 4429 × 300 + 4429 × 90 + 4429 × 7 language models, such as Bloom, OPT, GPT- = 1328700 + 398610 + 31003 NeoX, etc. We attribute Goat’s exceptional performance to LLaMA’s consistent tokeniza- = 1727310 + 31003 tion of numbers. To tackle more challenging = 1758313 tasks like large-number multiplication and divi- sion, we propose an approach that classifies What is 8914/64? tasks based on their learnability, and subse- quently decomposes unlearnable tasks, such 8914 − 64 × 100 = 8914 − 6400 = 2514 as multi-digit multiplication and division, into 2514 − 64 × 30 = 2514 − 1920 = 594 a series of learnable tasks by leveraging basic 594 − 64 × 9 = 594 − 576 = 18 arithmetic principles. We thoroughly exam- ine the performance of our model, offering a Therefore, 8914 ÷ 64 = 139 R 18 comprehensive evaluation of the effectiveness of our proposed decomposition steps. Addi- tionally, Goat-7B can be easily trained using Figure 1: Example of Goat’s response on addition, sub- LoRA on a 24GB VRAM GPU, facilitating re- traction, multiplication, and division tasks. Prompts are producibility for other researchers. We release marked in bold, followed by Goat-7B’s response. our model, dataset, and the Python script for dataset generation.1 arithmetic tasks, particularly multiplication and di- vision of large numbers, currently remains far from 1 Introduction optimal, with accuracy levels trending toward zero. Large language models (LLMs) have shown re- In this paper, we present Goat, a fine-tuned lan- markable proficiency across a wide range of nat- guage model that is GOod at Arithmetic Tasks. ural language processing (NLP) tasks (Brown Goat achieves state-of-the-art performance in ele- et al., 2020; Chowdhery et al., 2022; Thoppilan mentary arithmetic, including addition, subtraction, et al., 2022). Notably, GPT-4 (OpenAI, 2023) multiplication, and division of integers. We adopt has achieved state-of-the-art performances in such an end-to-end supervised instruction-finetuning tasks. However, it is surprising that such power- paradigm on LLaMA (Touvron et al., 2023), lever- ful language models still struggle with elementary aging a synthetically generated dataset containing arithmetic tasks. The performance of GPT-4 in around 1 million samples. Unlike previous research 1 https://github.com/liutiedong/goat. on arithmetic computation (Lee and Kim, 2023; Nogueira et al., 2021; Nye et al., 2021; Qian et al., to demonstrate the feasibility that supervised 2022; Zhou et al., 2022b), our study demonstrates fine-tuning alone can enable LLMs to gener- that through supervised fine-tuning alone and with- ate direct answers for certain elementary arith- out applying any special techniques, our model metic tasks, such as large-number addition is capable of generating direct answers for large- and subtraction, without applying any special number addition and subtraction with near-perfect techniques (Section 3.3). Previously effec- accuracy in a zero-shot setting. We attribute this ex- tive chain-of-thought (CoT) methods, such ceptional arithmetic ability to LLaMA’s consistent as those used for addition in sketchpad (Nye tokenization of numbers and show that this is al- et al., 2021) and LM Tutor (Qian et al., 2022), most impossible to achieve for previous LLMs such are no longer necessary. The impressive per- as Bloom (Scao et al., 2022), OPT (Zhang et al., formance is mainly attributed to LLaMA’s 2022), GPT-NeoX (Black et al., 2022), Pythia (Bi- consistent tokenization of numbers. derman et al., 2023), etc. However, the model encounters significant dif- To solve large-number multiplication and di- ficulties when generating direct answers for arith- vision, we propose a novel decomposition metic tasks like large-number multiplication and di- method based on the learnability of the task, vision. To overcome this challenge, we propose an leveraging basic arithmetic principles to en- approach that categorizes various arithmetic tasks sure human interpretability (Section 3.4). into learnable and unlearnable tasks, subsequently We systematically investigate the proposed decomposing the unlearnable tasks, such as multi- decomposition method and demonstrate its digit multiplication and division, into a series of effectiveness (Section 5). We conduct thor- learnable tasks by leveraging basic arithmetic prin- ough experiments on the decomposition steps ciples. Our approach ensures that the intermediate in a fully synthetic environment by mitigat- supervision which facilitates the model’s learning ing many hard-to-control aspects of natural is also easily understandable and interpretable by language. Our experimental setup offers an humans. We fine-tune our model to generate the ideal platform to study the impact of CoT and proposed CoT before generating the final answer, intermediate supervision. similar to sketchpad (Nye et al., 2021). Our method outperforms GPT-4’s long multiplication and long Our end-to-end instruction tuning pipeline can division methods by a large margin. We assess be easily integrated into existing instruction- the performance of our model using BIG-bench tuned language models (Chiang et al., 2023; (Srivastava et al., 2022) arithmetic sub-task, and Taori et al., 2023) and potentially enhance provide a comprehensive evaluation of the effec- their mathematical reasoning for math word tiveness of our proposed method. Our findings problems. We release the model, dataset, and suggest that the model can learn the pattern and script for generating the dataset. generalize to unseen data instead of purely mem- orizing the computation. Additionally, Goat-7B 2 Related Work can be conveniently trained using Low-Rank Adap- 2.1 Instruction Tuning tation (LoRA) (Hu et al., 2021) technique on a 24GB VRAM GPU, making it easily reproducible Instruction tuning (Chung et al., 2022; Ouyang for other researchers. et al., 2022; Sanh et al., 2021) is a technique used To summarize, our contributions include: to align pretrained language models with human in- structions. It enables targeted customization of Our model achieves state-of-the-art perfor- LLMs to specific tasks, enhancing their ability mance on various elementary arithmetic tasks, to generate more accurate and contextually rele- including addition, subtraction, multiplication, vant responses and improving the zero-shot perfor- and division of positive integers (Section 4). mance. The dataset used for instruction tuning can We show that an open-sourced model fine- be human-written (Ouyang et al., 2022), machine- tuned on a synthetically generated dataset has generated (Peng et al., 2023; Taori et al., 2023; the potential to achieve even higher accuracy Wang et al., 2022), or collected from web (Geng on arithmetic tasks compared to GPT-4. et al., 2023). Recently, there has been extensive To the best of our knowledge, we are the first research on fine-tuning LLaMA (Touvron et al., 2023) for various downstream tasks using instruc- strated effectiveness on 8-digit addition. However, tion tuning (Chiang et al., 2023; Geng et al., 2023; we show that previously effective CoT methods, Taori et al., 2023; Xu et al., 2023; Yunxiang et al., such as those used for addition in sketchpad (Nye 2023). Creating high-quality instruction tuning et al., 2021) and LM Tutor (Qian et al., 2022), are datasets can be expensive and time-consuming. In no longer necessary for certain arithmetic tasks this study, we utilize a simple Python program to like addition. By leveraging simple supervised fine- generate input-output pairs for arithmetic tasks. tuning alone, our model can perform addition and subtraction with sufficiently high accuracy. For 2.2 Arithmetic Reasoning challenging tasks like large-number multiplication Arithmetic reasoning has been a topic of interest in and division, previous studies (Muffo et al., 2022; NLP research for many years (Lu et al., 2022). Re- Lee and Kim, 2023) either fail to compute or are cently, the use of pretrained models (Brown et al., inefficient. Furthermore, our model is trained end- 2020; OpenAI, 2023) has shown great capabilities to-end such that it can follow human instructions. in solving math word problems. Particularly, chain of thought (CoT) (Kojima et al., 2022; Wei et al., 3 Method 2022; Zhou et al., 2022a) provides the model with 3.1 Language Model the intermediate steps to derive the final answer. However, studies have shown that LLMs struggle LLaMA (Touvron et al., 2023) is a collection of with basic arithmetic computation and often make open-source pretrained language models trained on arithmetic mistakes, even though the reasoning pro- trillions of tokens using publicly available datasets, cess is correct (Cobbe et al., 2021; Gao et al., 2022; and achieves state-of-the-art performance on many Schick et al., 2023). Consequently, one key chal- benchmarks. lenge of arithmetic reasoning, aside from mapping Previous studies (Kim et al., 2021; Nogueira natural language to arithmetic expressions, is how et al., 2021) have shown that tokenization is impor- to compute the generated arithmetic expressions tant for LLM’s arithmetic ability. Many commonly- with high accuracy. used subword tokenization techniques today are not ideal to represent numbers. However, LLaMA 2.3 Arithmetic Computation splits each digit into an individual token (Yuan Recent studies have explored using external tools et al., 2023), thereby ensuring consistent tokeniza- to evaluate arithmetic expressions. Toolformer tion of numbers, as shown in Appendix B. (Schick et al., 2023) and GSM8K (Cobbe et al., The selection of language models is crucial to 2021) invoke an external calculator to compute the our work. We believe the remarkable arithmetic generated arithmetic expression. PoT (Chen et al., ability demonstrated in this work is mainly at- 2022) and PAL (Gao et al., 2022) generate pro- tributed to LLaMA’s consistent tokenization of grams that can be executed to produce the final numbers. We experimentally verify that other answer. While arithmetic can be solved using cal- LLMs, such as Bloom, OPT, GPT-NeoX, and culators or programs easily, the ability to perform Pythia, finetuned on the same arithmetic dataset, arithmetic computation is a remarkable trait of hu- cannot match LLaMA’s arithmetic ability. man intelligence, and we anticipate LLMs should possess this ability as well. 3.2 Learnability of Arithmetic Tasks Previous studies have evaluated the arithmetic Wies et al. (2022) have provided a theoretical anal- abilities of LLMs. Nogueira et al. (2021) have ysis on the use of intermediate supervision for solv- evaluated addition and subtraction tasks. Muffo ing composite tasks. Specifically, they have shown et al. (2022) have further examined 2-digit multi- that for any family of tasks which on the one hand, plication. Yuan et al. (2023) have tested different are unlearnable, and on the other hand, can be de- types of arithmetic operations. CoT seems to be composed into a polynomial number of simple sub- a promising solution for arithmetic computation tasks, unlearnable composite problems can become as well. Similar to humans, autoregressive lan- learnable by using intermediate supervision or step- guage model may rely on intermediate supervision by-step CoT. to generate the final answer. Scratchpad (Nye et al., Building upon their analysis, we first experimen- 2021) finetunes the language models to produce tally categorize learnable and unlearnable tasks. In CoT before generating an answer, and has demon- the context of arithmetic computation, learnable Task Input Output Learnable Copying 59265395 59265395 Split 4536 4000 + 500 + 30 + 6 Comparison 8116449, 97863 8116449 > 97863 Ordering 3568, 9591, 8061 3568, 8061, 9591 Addition 1270769 + 264985867430 264987138199 Subtraction 40920 − 6173772696 −6173731776 Multiplication nD × 1D 591714761929184 × 4 2366859047716736 Division nD ÷ 1D 339229815457 ÷ 4 84807453864 R 1 Unlearnable Multiplication nD × mD 6983387 × 16919 118151924653 Division nD ÷ mD 64729486 ÷ 472 137138 R 350 Table 1: Summary and examples of learnable and unlearnable arithmetic tasks. For example, nD ÷ 1D means n-digit by 1-digit division, where n ≥ 1. Unlearnable tasks are mainly multi-digit multiplication and division where n, m > 1. There are some special cases mentioned in Appendix E. tasks generally refer to those for which the model working memory and performs poorly on compos- can be successfully trained to generate direct an- ite arithmetic tasks. This is particularly evident in swers, achieving sufficiently high accuracy within a the case of multiplication, which involves multiple predefined number of training epochs. Conversely, steps of addition. The inability of powerful mod- unlearnable tasks are those that the model strug- els like GPT-4 to directly solve unlearnable tasks gles to learn and generate direct answers correctly may suggest that generating direct answers for such even with extensive training. While the exact rea- tasks is extremely challenging, even with extensive son behind the varying learnability of tasks is not training. yet fully understood and requires further investiga- It is noteworthy that a task that is learnable for tion, we hypothesize that it is associated with the LLaMA may not necessarily be learnable for other complexity of the underlying pattern and the size LLMs, which is validated in our experiments in of working memory required for completing the Section 5.3. Furthermore, not all tasks classified as task (Bubeck et al., 2023). unlearnable are entirely impossible for the model We experimentally examine the learnability of to learn. For instance, 2-digit by 2-digit multi- these tasks by fine-tuning the model specifically for plication is considered an unlearnable task in our each task in a simplified synthetic environment (Ta- case. However, the model can still learn to generate ble 7). Our recognized learnable and unlearnable the direct answer by overfitting to the training set, tasks are listed in Table 1. which contains an exhaustive enumeration of all possible 2-digit multiplication. Nevertheless, the The categorization of tasks also aligns with hu- process takes nearly 10 epochs to achieve around man perception. With practice, humans can men- 90% accuracy. In contrast, by inserting our pro- tally calculate the addition and subtraction of two posed CoT before the final answer, the model can large numbers, writing down the final numerical achieve comparable accuracy in 2-digit multiplica- answer directly from the left (most significant fig- tion with only 1 epoch of training. These findings ure) to the right (least significant figure) without align with the claim (Wies et al., 2022) that the the need for sketchpad. However, mentally solving presence of intermediate supervision facilitates the large-number multiplication and division is undeni- learning process. ably a challenging task. We also observe that our classification of tasks 3.3 Addition and Subtraction is consistent with the performance of GPT-4. In particular, GPT-4 excels in generating direct an- Addition and subtraction tasks are learnable, as swers for large-number addition and subtraction. with supervised fine-tuning alone, the model ex- However, its accuracy significantly drops when it hibits a remarkable ability to accurately generate comes to multi-digit multiplication and division direct numerical answers. The model successfully tasks. Our observation aligns with the claim made captures the underlying patterns of the arithmetic by Bubeck et al. (2023) that GPT-4 has a short operations. This is evident from the model’s near- perfect accuracy on the unseen test set, despite 3.5 Division being trained on a very limited subset of the data. Similarly, we observe that n-digit by 1-digit divi- It is worth mentioning that addition and subtrac- sion is learnable. However, multi-digit division tion operations do not require the use of CoT. This is unlearnable. We design a novel CoT leverag- contrasts with previous studies that have employed ing a modified slow division method based on the CoT for addition and subtraction tasks (Lee and following recurrence equation Kim, 2023; Nye et al., 2021; Qian et al., 2022). Rj − D × (qn−(j+1) × 10j ) = Rj+1 3.4 Multiplication where Rj is the j-th partial remainder of the divi- sion, qn−(j+1) is the digit of the quotient in position n − (j + 1) numbered from least significant 0 to We experimentally verify that n-digit by 1-digit most significant n − 1, n is the number of digits multiplication is learnable. In contrast, multi-digit in the quotient, and D is the divisor. Specifically, multiplication poses significant challenges for the the main idea is to subtract multiples of the divisor model, suggesting it to be an unlearnable task. To from the dividend until the remainder is less than overcome this issue, we adopt a similar strategy the divisor. used in sketchpad (Nye et al., 2021), which fine- Here is a detailed breakdown of the CoT used in tunes the LLMs to generate CoT before generat- Fig. 1. Consider the first iteration (first equation). ing the answer. Specifically, we propose a CoT The first step 8914−64×100 requires the model to that decomposes the multi-digit multiplication into copy the dividend and the divisor, and subsequently a series of 5 learnable sub-tasks: (1) extraction: generate a number qn−(j+1) × 10j such that the extract the arithmetic expression from the natural product of qn−(j+1) × 10j and the divisor D is less language instruction, (2) split: split the smaller than or equal to the partial remainder Rj. This in- number of the two into place values, (3) expan- herently involves two learnable tasks: “n-digit by 1- sion: expand the sum based on the distributive digit multiplication” and “comparison”. We experi- property, (4) product: compute each product si- mentally show that this composite task is learnable. multaneously, and (5) adding term by term: add The second step 8914 − 64 × 100 = 8914 − 6400 the first two terms and copy the rest, and the final mainly involves a “copying” task and an “n-digit sum is obtained. by 1-digit multiplication” task. The third step Consider the example in Fig. 1. Firstly, the arith- 8914 − 6400 = 2514 leverages “subtraction”. The metic expression 397 × 4429 is extracted from the process iterates until the leftover is less than the instruction, which can be considered as a “copying” divisor, which implies the model has to implicitly task. Secondly, 397×4429 = 4429×(300+90+7) learn comparison. Finally, the model generates the involves two learnable tasks. The larger num- quotient by combining all qn−(j+1) ’s in previous ber of the two is placed in front and then the iterations, which can be considered as the inverse smaller one is split, which is similar to “order- of the “split” task, and finally copies the remainder ing” and “split” learnable tasks. The ordering if it is not zero. ensures that there are fewer summation terms in A summary of prompts and expected output for the next step, thereby reducing the CoT length. various tasks are shown in Table 2. Thirdly, the sum is expanded using distributive law: 4429 × (300 + 90 + 7) = 4429 × 300 + 4429 × 3.6 Settings 90 + 4429 × 7, which is similar to “copying” task. In this paper, we consider the addition and subtrac- Next, 4429 × 300 + 4429 × 90 + 4429 × 7 = tion of two positive integers with each containing 1328700 + 398610 + 31003 where the products up to 16 digits. It is worth noting that the result of are computed at once by applying “multiplication subtraction can be negative. To limit the maximum n-digit by 1-digit” with zeros copied at the end of generated sequence length, we consider the mul- each product. Finally, we take the sum of the first tiplication of two positive integers whose product two terms at each step, and copy the rest terms, falls within 12 digits, and the division of two posi- leveraging “addition” and “copying”. Hence, a tive integers resulting in a quotient within 6 digits composite unlearnable task is broken down into where the dividend is less than 12 digits. Since simpler tasks that are all learnable. we focus on arithmetic tasks of integers, we aim Task Learnable Prompt CoT Target ADD ✓ 1463456 + 2107 ✗ 1463456 + 2107 = 1465563 SUB ✓ 2348233 minus 483579? ✗ 2348233 - 483579 = 1864654 MUL nD × 1D ✓ 593295 times 7 ✗ 593295 * 7 = 4153065 nD × mD ✗ Calculate 24 x 79 ✓ 24 * 79 = 24 * (70 + 9) = 24 * 70 + \ 24 * 9 = 1680 + 216 = 1896 DIV nD ÷ 1D ✓ Please tell 3651803/7 ✗ 3651803 / 7 = 521686 R 1 nD ÷ mD ✗ What is 2546/38? ✓ 2546 - 38 * 60 = 2546 - 2280 = 266 266 - 38 * 7 = 266 - 266 = 0 Therefore, 2546 / 38 = 67 Table 2: Examples of prompts and targets for fine-tuning LLaMA. “\nAnswer: ” is appended at the end of each prompt. It should be noted that there are a few special cases when CoT is not required (see Appendix E). to obtain the least positive remainder in the case adaptability to diverse question formats, such as when it is not divisible. randomly removing spaces between numbers and In Section 5.2, we present an analysis showcas- symbols in the arithmetic expression, replacing “*” ing the limited extrapolation capabilities of fine- with “x” or “times”, etc. tuned LLMs. Consequently, input data that falls Goat-7B can be easily fine-tuned using LoRA on outside the distribution of the training data is un- a 24GB VRAM GPU. In particular, the fine-tuning likely to yield reasonable answers. Our method process for a specific arithmetic sub-task, such as potentially applies to numbers with more digits, 8-digit addition using 100K instances, takes only though the training cost will increase correspond- approximately 1.5 hours on an A10 GPU to achieve ingly. near-perfect accuracy. The training hyperparame- ters are listed in Appendix A. 3.7 Dataset We generate the dataset synthetically using a 4 Experiments Python script. The dataset consists of around 1 mil- lion question-answer pairs. The answer contains We evaluate our model using BIG-bench arithmetic the proposed CoT as well as the final numerical out- dataset (Srivastava et al., 2022), as well as our extra put. The numbers are randomly generated, hence selected tasks. The results are shown in Table 3. ensuring a very low probability of instances being Notably, in a zero-shot setting, Goat-7B achieves duplicated, although small numbers may be sam- comparable or even higher accuracy on BIG-bench pled multiple times. We sample from log space to compared to the few-shot PaLM-540B. ensure the numbers are equally likely to be sampled from different orders of magnitude, which is simi- 4.1 Metric lar to the sampling method used by Lee and Kim (2023). The details of the dataset are presented in We first compute the accuracy based on the stan- Appendix F. dard exact string match (Appendix C). We observe that GPT-4’s accuracy under exact string match 3.8 Fine-tuning is almost identically zero on tasks involving large To enable the model to solve arithmetic problems numbers. However, in many cases where the fi- based on instructions and facilitate natural lan- nal answer is incorrect, the majority of digits in guage question answering, we generate hundreds the generated answer align with the target number, of instruction templates using ChatGPT (Table 6). with only a few digits being incorrect. Inspired During the instruction tuning process, we randomly by recent study on the emergent abilities of LLMs select a template for each arithmetic input from the (Schaeffer et al., 2023), we include a digit match training set, and fine-tune LLaMA-7B similar to metric that can reflect the per-token error rate of the method used in Alpaca (Taori et al., 2023). We the output, as each digit is uniquely represented by apply various techniques to enhance the model’s a token in LLaMA. Task BIG-bench Extra Tasks ADD 1D 2D 3D 4D 5D 8D+8D 16D+8D 16D+16D GPT-4 100/100 100/100 99.6/99.9 98.8/99.6 94.1/98.5 92.1/98.3 9.4/70.4 94.1/99.5 Goat-7B 100/100 100/100 99.4/99.8 98.3/99.5 98.1/99.4 97.8/99.4 97.1/99.6 97.6/99.7 SUB 1D 2D 3D 4D 5D 8D−8D 16D−8D 16D−16D GPT-4 100/100 100/100 99.2/99.6 98.9/99.6 92.4/98.1 70.5/91.5 10.6/68.8 59.6/88.2 Goat-7B 100/100 100/100 99.7/99.9 98.6/99.6 98.4/99.5 96.8/99.3 95.8/99.2 96.3/99.3 MUL 1D 2D 3D 4D 5D 1D×16D 4D×8D 6D×6D GPT-4 100/100 99.4/99.8 30.3/83.0 5.3/61.8 0.0/47.9 61.5/92.3 0.0/45.9 0.0/49.8 Goat-7B 100/100 100/100 97.8/99.4 96.9/99.2 96.7/99.3 99.7/99.9 88.1/97.8 96.8/99.5 DIV 1D 2D 3D 4D 5D 16D÷1D 6D÷3D 12D÷6D GPT-4 100/100 100/100 94.5/96.3 90.9/92.1 53.4/73.2 54.0/84.3 6.4/48.6 0.0/29.5 Goat-7B 100/100 100/100 99.5/99.7 99.0/99.5 96.5/98.1 99.0/99.7 94.1/96.1 89.3/93.5 Table 3: The result of GPT-4 and Goat-7B on BIG-bench Arithmetic sub-task and extra selected arithmetic tasks, using metrics Exact String Match/Digit Match (Appendix C), shown in percentage. We test GPT-4 and Goat with exactly the same questions and prompts. We evaluate GPT-4 using the API version on May 10th. For Big-bench tasks, nD refers the n-digit by n-digit operation, except for division where nD means n-digit by m-digit where m ≤ n. BIG-bench only includes division operation without remainder, whereas in extra tasks we include the cases where the remainder is not zero and ask GPT-4 to output the answer in "quotient R remainder" format. It should be noted that we exclude the BIG-bench test data from our training dataset as much as possible, although the overlap is unavoidable for operations involving small numbers. 4.2 Comparison unclear, one possible factor could be GPT-4’s in- consistent number tokenization (Table 5), which Comparing the performance of Goat and GPT-4 makes it difficult to align the corresponding digits for large-number multiplication and division may of two numbers. seem unfair, as GPT-4 generates direct answers while Goat relies on CoT. Hence, we also evalu- 5 Analysis ate GPT-4’s performance with CoT by appending “Solve it step by step” at the end of each prompt. By 5.1 Ablation study default, GPT-4 uses long multiplication and long division methods. However, we observe that gen- full CoT no split no expansion no adding term by term no CoT erating CoT only leads to marginal improvement 1.00 in accuracy. In some cases, the intermediate steps 0.75 from long multiplication and division are incorrect, but surprisingly the final answer is correct. This 0.50 implies that GPT-4 does not effectively take ad- vantage of intermediate supervision from CoT to 0.25 improve the final output. We identify the following 0.00 3 common errors from GPT-4’s solution, which re- 50000 100000 150000 200000 sults in incorrect final answers: (1) the alignment of corresponding digits, (2) copying of numbers, and Figure 2: Accuracy (exact string match) against the (3) the intermediate result from n-digit by 1-digit number of samples seen during the training of 4D × 4D multiplication. task. Evaluated on the same randomly generated unseen Additionally, we observe that GPT-4 performs test set using training checkpoints. reasonably well on 8D +8D and 16D +16D tasks, but fails on most 16D + 8D tasks, though intu- Here we want to study the usefulness and effec- itively 16D + 8D should be relatively easier than tiveness of each intermediate decomposition step. 16D+16D. While the exact reason for this remains Specifically, for multiplication (Fig. 2), we com- full CoT no product no CoT reported in (Kim et al., 2021), highlighting a lim- 1.00 itation of our fine-tuned model and underscoring 0.75 the significance of training data distribution. 0.50 exact string match digit match 1.00 0.25 0.75 0.00 0.50 20000 40000 60000 80000 100000 0.25 Figure 3: Accuracy (exact string match) against the 0.00 number of samples seen during the training of 6D ÷ 3D 16 17 18 19 20 21 task. Evaluated on the same randomly generated unseen No. of Digits test set using training checkpoints. Figure 4: Accuracy against the number of digits for the pare the accuracy of 4-digit by 4-digit multiplica- addition task. The model is trained up to 16D+16D, and tion by removing one particular step in the CoT, tested on 17D+17D onward. including split, expansion, adding term by term (referring to G), as well as no CoT. For division (Fig. 3), we compare the accuracy of 6-digit by 5.3 Comparison with Other LLMs 3-digit division after removing the middle step that We conduct comprehensive experiments on a vari- computes the product (referring to G), as well as no ety of LLMs, including Bloom, OPT, GPT-J, GPT- CoT. To minimize the impact caused by natural lan- NeoX, and Pythia. These models are fine-tuned guage, we conduct an ablation study in a simplified using the identical dataset as that for Goat, main- synthetic environment (Table 7). taining consistency in the training hyperparameters. The multiplication results suggest that the Our experiment shows that they all struggle with “adding term by term” step plays a crucial role in arithmetic tasks. Even for tasks that are considered obtaining the final answer. In contrast, the “split” learnable for LLaMA, such as multi-digit addition, and “expand” steps have minimal impact, and can the loss during fine-tuning is significantly higher potentially be omitted for generating more concise than that of LLaMA. The observation underscores CoT. This can be attributed to the nature of these the claim made in (Nogueira et al., 2021) that tok- two intermediate steps, which primarily involve enization is a crucial factor in the performance of simple and learnable tasks like copying and com- arithmetic tasks. parison. Nevertheless, we still retain these steps to ensure human interpretability. 5.4 Few-shot Prompting with GPT-4 The accuracy of exact string match without CoT GPT-4 demonstrates powerful in-context learning remains consistently at zero for both 4D × 4D abilities. We further examine the effectiveness of multiplication and 6D ÷ 3D division. This further our proposed decomposition method for solving showcases the validity of our approach, as break- large-number multiplication and division by using ing down complex arithmetic tasks into a series few-shot prompting with GPT-4 (see Appendix H). of learnable tasks can indeed facilitate the training We observe that our decomposition method allows process for LLMs. GPT-4 to generate correct answers more frequently than using its default long multiplication and divi- 5.2 Extrapolation sion methods. This further supports the effective- Extrapolation refers to the ability of the model to ness and validity of our approach. Examples of the predict data that lies out-of-distribution (OOD) of prompt and output are shown in Appendix H. training data. We test addition for numbers larger 6 Limitations than those in the training data distribution. The re- sults reveal that the model has limited extrapolation Humans are capable of performing multiplication capabilities. There is a gradual drop in accuracy, and division on arbitrarily large numbers, providing as the test set deviates further from the training sufficient time and space for calculations. In con- set. This observation is consistent with the result trast, LLMs often suffer from extrapolation prob- lems. The models are unlikely to generate reason- Sébastien Bubeck, Varun Chandrasekaran, Ronen El- able answers if the input deviates significantly from dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, the distribution of training data. To enhance the Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, human interpretability of intermediate supervision, and Yi Zhang. 2023. Sparks of artificial general in- we use the straightforward CoT that follows simple telligence: Early experiments with gpt-4. basic arithmetic rules. However, this design may Wenhu Chen, Xueguang Ma, Xinyi Wang, and not be the most efficient way to facilitate the final William W Cohen. 2022. Program of thoughts answer generation. There are potentially more suit- prompting: Disentangling computation from reason- able multiplication and division algorithms for the ing for numerical reasoning tasks. arXiv preprint model to learn. Besides, our research only focuses arXiv:2211.12588. on elementary arithmetic operations involving inte- Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, gers. Nevertheless, we anticipate that our method Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. could be applicable to decimal computation as well. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 7 Conclusion Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, In summary, we demonstrate the feasibility that Maarten Bosma, Gaurav Mishra, Adam Roberts, supervised fine-tuning alone can enable LLMs to Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling perform certain basic arithmetic operations with language modeling with pathways. arXiv preprint high accuracy. With our proposed CoT, our model arXiv:2204.02311. achieves state-of-the-art performance on various Hyung Won Chung, Le Hou, Shayne Longpre, Bar- elementary arithmetic tasks. Our research offers an ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi excellent platform for investigating the mechanism Wang, Mostafa Dehghani, Siddhartha Brahma, et al. of working memory and the influence of intermedi- 2022. Scaling instruction-finetuned language models. ate supervision on text generation. Our method can arXiv preprint arXiv:2210.11416. be easily integrated with other instruction-tuned Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, LLMs and has the potential to further enhance Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias arithmetic reasoning abilities in solving math word Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math problems. word problems. arXiv preprint arXiv:2110.14168. Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, References Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- ham Neubig. 2022. Pal: Program-aided language Stella Biderman, Hailey Schoelkopf, Quentin Anthony, models. arXiv preprint arXiv:2211.10435. Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal- Prashanth, Edward Raff, et al. 2023. Pythia: A suite lace, Pieter Abbeel, Sergey Levine, and Dawn Song. for analyzing large language models across training 2023. Koala: A dialogue model for academic re- and scaling. arXiv preprint arXiv:2304.01373. search. Blog post, April, 1. Sidney Black, Stella Biderman, Eric Hallahan, Quentin Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Anthony, Leo Gao, Laurence Golding, Horace Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, He, Connor Leahy, Kyle McDonell, Jason Phang, and Weizhu Chen. 2021. Lora: Low-rank adap- Michael Pieler, Usvsn Sai Prashanth, Shivanshu Puro- tation of large language models. arXiv preprint hit, Laria Reynolds, Jonathan Tow, Ben Wang, and arXiv:2106.09685. Samuel Weinbach. 2022. GPT-NeoX-20B: An open- Jeonghwan Kim, Giwon Hong, Kyung-min Kim, Junmo source autoregressive language model. In Proceed- Kang, and Sung-Hyon Myaeng. 2021. Have you ings of BigScience Episode #5 – Workshop on Chal- seen that number? investigating extrapolation in lenges & Perspectives in Creating Large Language question answering models. In Proceedings of the Models, pages 95–136, virtual+Dublin. Association 2021 Conference on Empirical Methods in Natural for Computational Linguistics. Language Processing, pages 7031–7037, Online and Punta Cana, Dominican Republic. Association for Tom Brown, Benjamin Mann, Nick Ryder, Melanie Computational Linguistics. Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Askell, et al. 2020. Language models are few-shot taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- learners. Advances in neural information processing guage models are zero-shot reasoners. arXiv preprint systems, 33:1877–1901. arXiv:2205.11916. Soochan Lee and Gunhee Kim. 2023. Recursion of Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta thought: Divide and conquer reasoning with language Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola models. Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and arXiv preprint arXiv:2302.04761. Kai-Wei Chang. 2022. A survey of deep learn- ing for mathematical reasoning. arXiv preprint Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, arXiv:2212.10535. Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Matteo Muffo, Aldo Cocco, and Enrico Bertino. 2022. Adrià Garriga-Alonso, et al. 2022. Beyond the Evaluating transformer language models on arith- imitation game: Quantifying and extrapolating the metic operations using number decomposition. In capabilities of language models. arXiv preprint Proceedings of the Thirteenth Language Resources arXiv:2206.04615. and Evaluation Conference, pages 291–297, Mar- seille, France. European Language Resources Asso- Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann ciation. Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. An instruction-following llama model. GitHub repos- 2021. Investigating the limitations of transform- itory. ers with simple arithmetic tasks. arXiv preprint Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam arXiv:2102.13019. Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, 2022. Lamda: Language models for dialog applica- Henryk Michalewski, Jacob Austin, David Bieber, tions. arXiv preprint arXiv:2201.08239. David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratch- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier pads for intermediate computation with language Martinet, Marie-Anne Lachaux, Timothée Lacroix, models. arXiv preprint arXiv:2112.00114. Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and effi- OpenAI. 2023. Gpt-4 technical report. cient foundation language models. arXiv preprint arXiv:2302.13971. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- Sandhini Agarwal, Katarina Slama, Alex Ray, et al. isa Liu, Noah A Smith, Daniel Khashabi, and Han- 2022. Training language models to follow instruc- naneh Hajishirzi. 2022. Self-instruct: Aligning lan- tions with human feedback. Advances in Neural guage model with self generated instructions. arXiv Information Processing Systems, 35:27730–27744. preprint arXiv:2212.10560. Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten ley, and Jianfeng Gao. 2023. Instruction tuning with Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. gpt-4. arXiv preprint arXiv:2304.03277. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903. Jing Qian, Hong Wang, Zekun Li, Shiyang Li, and Xifeng Yan. 2022. Limitations of language models Noam Wies, Yoav Levine, and Amnon Shashua. 2022. in arithmetic and symbolic induction. arXiv preprint Sub-task decomposition enables learning in sequence arXiv:2208.05051. to sequence tasks. arXiv preprint arXiv:2204.02892. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine 2023. Baize: An open-source chat model with Chaffin, Arnaud Stiegler, Teven Le Scao, Arun parameter-efficient tuning on self-chat data. arXiv Raja, et al. 2021. Multitask prompted training en- preprint arXiv:2304.01196. ables zero-shot task generalization. arXiv preprint Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, arXiv:2110.08207. and Songfang Huang. 2023. How well do large lan- guage models perform in arithmetic tasks? arXiv Teven Le Scao, Angela Fan, Christopher Akiki, El- preprint arXiv:2304.02015. lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Matthias Gallé, et al. 2022. Bloom: A 176b- Zhang You. 2023. Chatdoctor: A medical chat model parameter open-access multilingual language model. fine-tuned on llama model using medical domain arXiv preprint arXiv:2211.05100. knowledge. arXiv preprint arXiv:2303.14070. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Susan Zhang, Stephen Roller, Naman Goyal, Mikel 2023. Are emergent abilities of large language mod- Artetxe, Moya Chen, Shuohui Chen, Christopher De- els a mirage? arXiv preprint arXiv:2304.15004. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. A Hyperparameters arXiv preprint arXiv:2205.01068. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Hyperparameter Value Olivier Bousquet, Quoc Le, and Ed Chi. 2022a. batch size 128 Least-to-most prompting enables complex reason- learning rate 0.0003 ing in large language models. arXiv preprint lora r 64 arXiv:2205.10625. lora alpha 64 Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron lora target module q, v, k, o Courville, Behnam Neyshabur, and Hanie Sedghi. lora dropout 0.05 2022b. Teaching algorithmic reasoning via in- context learning. arXiv preprint arXiv:2211.09066. epoch 1 Table 4: Hyperparameters for fine-tuning LLaMA-7B. B Tokenization Nogueira et al. (2021) demonstrate that models with inconsistent tokenization of numbers barely learn the addition of 2-digit numbers, and it com- pletely fails to learn the addition of larger numbers. Specifically, it has an accuracy of zero for 5 digits or more. They attribute this failure to the lack of systematic tokenization of individual digits. For instance, “123” might be tokenized as “12” and “3”, while “234” might be tokenized as “2” and “34”. Consequently, the model is required to learn that the embedding of a token may represent either a single digit or two digits and so on. Hence, it might be challenging for the model to learn to map an embedding to a number when the number of digits it represents changes irregularly. In Table 5, we compare number tokenization across different LLMs. C Metric Exact string match is defined as 1 if the output string exactly matches the target string, and 0 oth- erwise. Then we take the average of exact string match for each task. Char error rate (CER) is de- fined as the percentage of characters that were incorrectly predicted. We compute CER using Python torchmetrics package. Then we define digit match accuracy as 1 − cer. We include this metric because, for difficult tasks, the exact string match could be identically zero, making it hard to evalu- ate the performance. In many cases, both GPT-4 and Goat may have very few incorrect digits in the middle of the generated answer, and the number of digits in the generated answer generally matches the target number. Model Number Tokenization LLaMA 74815 [1, 29871, 29955, 29946, 29947, 29896, 29945] 7481 [1, 29871, 29955, 29946, 29947, 29896] 748 [1, 29871, 29955, 29946, 29947] 74 [1, 29871, 29955, 29946] 7 [1, 29871, 29955] GPT-4 74815 [20338, 868] 7481 [20338, 16] 748 74 7 Bloom 74815 [88241, 2057] 7481 [88241, 20] 748 74 7 OPT 74815 [2, 39373, 996] 7481 [2, 406, 34490] 748 [2, 39373] 74 [2, 5243] 7 [2, 406] Pythia 74815 [24, 2385, 1010] GPT-NeoX-20B 7481 [24, 34474] MPT-7B 748 [24, 2385] 74 7 GPT-J 74815 [48246, 1314] GPT-Neo 7481 [22, 40271] 748 74 7 ChatGLM 74815 [5, 25, 16, 23, 9, 15, 130001, 130004] 7481 [5, 25, 16, 23, 9, 130001, 130004] 748 [5, 25, 16, 23, 130001, 130004] 74 [5, 25, 16, 130001, 130004] 7 [5, 25, 130001, 130004] Table 5: Comparison of number tokenization of various LLMs. It should be noted that ChatGLM also splits each digit into an individual token. Evaluating ChatGLM’s arithmetic abilities will be left as future work. Index Template 1 {arithmetic} = 2 What is {arithmetic}? 3 Compute {arithmetic} 4 Solve {arithmetic} 5 Determine {arithmetic} 6 Find {arithmetic} 7 What is the result of {arithmetic}? 8 Please help me calculate {arithmetic}. 9 Solve the following problem: {arithmetic} 10 I am looking for the value of {arithmetic}. Can you help? 11 What is the numerical value of {arithmetic}? 12 Help me obtain {arithmetic} 13 Show me the result of {arithmetic}? 14 Kindly calculate {arithmetic} for me. 15 Determine the value for {arithmetic}. 16 Can you please compute {arithmetic}? 17 Find the numerical value of {arithmetic}? 18 I would appreciate it if you could assist me in calculating {arithmetic}. 19 Please work out {arithmetic}. 20 What is the answer to {arithmetic}?...... Table 6: Example templates to fine-tune arithmetic tasks with natural language instructions, generated by ChatGPT. During training, {arithmetic} is replaced by the randomly generated arithmetic expression, like 3425 ∗ 5823. D Simplified Synthetic Environment task is similar to learnable n-digit by 1-digit multiplication, with the zeros being copied at We use the simplified synthetic environment to the end of the product. study the effectiveness of various CoT, by avoiding many hard-to-control aspects of natural languages. The dividend is equal to the divisor. In that The difference between this and Goat is that we use case, the quotient is identically one. For ex- a more structured prompt without any instruction ample, 358 ÷ 358 = 1. template and a straightforward completion of the task. This enables easy comparison between the The dividend is less than the divisor. In model’s performance on different tasks, allowing that case, the quotient is zero and the re- us to examine the learnability of various sub-tasks mainder equals the dividend. For example, and explore the effectiveness of the proposed CoT. 423 ÷ 968 = 0 R 423. The input and output examples for the simplified synthetic environment are shown in Table 7. F Dataset In general, it is difficult to determine the optimal E Special Cases proportion for each task. The number and compo- In general, multi-digit multiplication and division sition of data samples also depend on the problem are considered unlearnable, and we use the decom- settings (see Section 3.6). We empirically find that position method to solve them. However, some n-digit by 1-digit multiplication and division may special cases within multi-digit multiplication and be easier than other tasks, as it requires fewer sam- division are learnable, and in these cases, we omit ples to reach the same level of accuracy as other CoT and generate the direct answer: tasks during task-specific fine-tuning in the simpli- fied synthetic environment. It is noteworthy that For multiplication, one of the two numbers the data samples are all randomly generated, so the contains only one non-zero digit, such as probability of the occurrence of duplicated samples 857483 × 400 = 342993200. This type of is very low for large numbers. Therefore, the train- Task CoT Prompt Target Addition ✗ 1463456 + 2107 1465563 Subtraction ✗ 2348233 - 483579 1864654 Multiplication nd × 1d ✗ 593295 * 7 4153065 nd × md ✓ 24 * 79 24 * (70 + 9) = 24 * 70 + 24 * 9 = 1680 + 216 = 1896 Division nd ÷ 1d ✗ 3651803 / 7 521686 R 1 nd ÷ md ✓ 2551 / 38 2546 - 38 * 60 = 2546 - 2280 = 266 266 - 38 * 7 = 266 - 266 = 0 Therefore, 2551 / 38 = 67 Table 7: Examples of input and output for training and testing in the simplified synthetic environment, which is used for testing the learnability of sub-tasks and ablation studies. Specifically, “+”, “-”, “*”, and “\” are used for addition, subtraction, multiplication, and division, respectively. Space is inserted between numbers and symbols. The input and output are formatted to mitigate the influence of natural language. Division n/m the ablation study as it is crucial for multi-digit 14.0% Addition 23.5% multiplication. Division n/1 7.5% Multiplication Calculate 397 x 4429 \nAnswer: Multiplication nxm Subtraction 23.9% 23.5% 397 × 4429 (1) Multiplication nx1 7.5% = 4429 × (300 + 90 + 7) (2) = 4429 × 300 + 4429 × 90 + 4429 × 7 Figure 5: Composition of tasks in the dataset. (3) ing loss can reflect the test accuracy on unseen the = 1328700 + 398610 + 31003 (4) test set, if the dataset is only trained for one epoch. = 1727310 + 31003 (5) Since the synthetic dataset can be generated very = 1758313 (6) easily, we first create a dataset that contains a suffi- cient number of data samples for training and then For division, the ablation study is performed by observe the training loss and apply early stopping. removing the middle step (bold) that computes the We observe that the training loss does not show any product for all iterations, while keeping other steps significant decrease after training on about one mil- unchanged. lion samples. It should be noted that convergence also depends on other hyper-parameters such as Division batch size and learning rate. Hence, it is recom- mended to use a dataset larger than what is neces- What is 8914/64? \nAnswer: sary and terminate the training process when the training loss no longer decreases. 8914 − 64 × 100 = 8914 − 6400 =2514 2514 − 64 × 30 = 2514 − 1920 =594 G Ablation Study 594 − 64 × 9 = 594 − 576 =18 We name the steps (shown in the box below) as Therefore, 8914 ÷ 64 = 139 R 18 (1) extraction, (2) split, (3) expansion, (4) product, and (5, 6,... ) adding term by term. The ablation study is performed by removing one particular step while keeping other steps unchanged. We exclude the (1) “extraction” and (4) “product” steps from H Few-shot Prompting with GPT-4 3-Shot prompting for division Prompt: Here are two examples showing 3-shot prompting Here are 3 examples of question-solution using GPT-4. pairs for solving arithmetic division prob- lem. Solve the last question using the same 3-Shot prompting for multiplication format given in the solutions. “question”: “440331 / 22 = ”, Prompt: “solution”: “440331 - 22 * 20000 = 440331 Here are 3 examples of question-solution - 440000 = 331\n331 - 22 * 10 = 331 - pairs for solving arithmetic multiplication 220 = 111\n111 - 22 * 5 = 111 - 110 = problems. Solve the last question using the 1\nTherefore, 440331 / 22 = 20015 R 1” same format given in the solutions. “question”: “115247727 / 970051 = ”, “question”: “235658 * 91204 = ”, “solution”: “115247727 - 970051 “solution”: “235658 * 91204 = 235658 * 100 = 115247727 - 97005100 = * (90000 + 1000 + 200 + 4) = 235658 * 18242627\n18242627 - 970051 * 10 = 90000 + 235658 * 1000 + 235658 * 200 + 18242627 - 9700510 = 8542117\n8542117 235658 * 4 = 21209220000 + 235658000 - 970051 * 8 = 8542117 - 7760408 = + 47131600 + 942632 = 21444878000 781709\nTherefore, 115247727 / 970051 = + 47131600 + 942632 = 21492009600 + 118 R 781709” 942632 = 21492952232” “question”: “4144442 / 234443 = ”, “question”: “59909636 * 6070 = ”, “solution”: “4144442 - 234443 * 10 = “solution”: “59909636 * 6070 = 59909636 * 4144442 - 2344430 = 1800012\n1800012 (6000 + 70) = 59909636 * 6000 + 59909636 - 234443 * 7 = 1800012 - 1641101 = * 70 = 359457816000 + 4193674520 = 158911\nTherefore, 4144442 / 234443 = 17 363651490520” R 158911” “question”: “7661 * 9461 = ”, “question”: “262122 / 591 = ” “solution”: “7661 * 9461 = 9461 * (7000 + 600 + 60 + 1) = 9461 * 7000 + 9461 * GPT-4: 600 + 9461 * 60 + 9461 * 1 = 66227000 To solve the division problem of 262122 + 5676600 + 567660 + 9461 = 71903600 divided by 591 in the same format, we’ll + 567660 + 9461 = 72471260 + 9461 = subtract multiples of 591 from 262122 until 72480721” we reach a number less than 591, then that’s our remainder. “question”: “58061 * 3569 = ”, So, “solution”: “262122 - 591 * 400 = 262122 - GPT-4: 236400 = 25722\n25722 - 591 * 40 = 25722 To solve the multiplication problem of - 23640 = 2082\n2082 - 591 * 3 = 2082 - 58061 multiplied by 3569 in the same for- 1773 = 309\nTherefore, 262122 / 591 = 443 mat, we’ll distribute 58061 across the sum R 309” of the digits of 3569 multiplied by their re- spective places. So, “solution”: “58061 * 3569 = 58061 * (3000 + 500 + 60 + 9) = 58061 * 3000 + 58061 * 500 + 58061 * 60 + 58061 * 9 = 174183000 + 29030500 + 3483660 + 522549 = 203213500 + 3483660 + 522549 = 206697160 + 522549 = 207219709”

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks PDF

Document Details

Tags

Related

Summary

Full Transcript