A Survey of Large Language Models PDF

1 A Survey of Large Language Models Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, J...

1 Wayne Zhang, arXiv:2303.18223v15 [cs.CL] 13 Oct 2024 1 “The L ANGUAGE express childhood however, ing and unless algorithms. to achieve communicate Technically, approaches In general, of word future 77 *37 *37 *37 *37 %(57 %(57 (a) Query=”Language Model” Fig. 1: The trends of the cumulative numbers of and “large language model” (since October 2019), the keyphrases in title or abstract by months. We models” have been explored at an earlier time. progress of LLMs. A sharp increase occurs after that contain “large language model” in title Task Specific task solving helper capacity n-gram models Statistical methods Fig. 2: An evolution process of the four generations Note that the time period for each stage may not date of the most representative studies at each two representative studies to name the two approaches: NPLM (“Natural language processing (almost) from scratch”). this figure. various NLP tasks. Furthermore, word2vec [19, 20] was proposed to build a simplified shallow neural network for learning distributed word representations, which were demonstrated to be very effective across a variety of NLP tasks. These studies have initiated the use of language models for representation learning (beyond word sequence modeling), having an important impact on the field of NLP. Pre-trained language models (PLM). As an early at- tempt, ELMo was proposed to capture context-aware word representations by first pre-training a bidirectional LSTM (biLSTM) network (instead of learning fixed word representations) and then fine-tuning the biLSTM network according to specific downstream tasks. Furthermore, based on the highly parallelizable Transformer architecture with self-attention mechanisms, BERT was proposed by pre-training bidirectional language models with specially have explored the performance limit by training an ever larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- parameter PaLM). Although scaling is mainly conducted in model size (with similar architectures and pre-training tasks), these large-sized PLMs display different behaviors from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- parameter GPT-2) and show surprising abilities (called emer- gent abilities ) in solving a series of complex tasks. For example, GPT-3 can solve few-shot tasks through in-context learning, whereas GPT-2 cannot do well. Thus, the research community coins the term “large language models (LLM)”1 for these large-sized PLMs [32–35], which attract increasing research attention (See Figure 1). A remarkable application of LLMs is ChatGPT2 that adapts the LLMs from the GPT series for dialogue, which presents an amazing conversation ability with humans. We can observe a sharp increase of the arXiv papers that are related to LLMs after the release of ChatGPT in Figure 1. As discussed before, language model is not a new tech- nical concept specially for LLMs, but has evolved with the advance of artificial intelligence over the decades. Early lan- guage models mainly aim to model and generate text data, while latest language models (e.g., GPT-4) focus on complex task solving. From language modeling to task solving, it is an important leap in scientific thinking, which is the key to understand the development of language models in the re- search history. From the perspective of task solving, the four generations of language models have exhibited different lev- els of model capacities. In Figure 2, we describe the evolu- tion process of language models in terms of the task solving capacity. At first, statistical language models mainly assisted in some specific tasks (e.g., retrieval or speech tasks), in which the predicted or estimated probabilities can enhance the performance of task-specific approaches. Subsequently, neural language models focused on learning task-agnostic representations (e.g., features), aiming to reduce the efforts for human feature engineering. Furthermore, pre-trained language models learned context-aware representations that can be optimized according to downstream tasks. For the latest generation of language model, LLMs are enhanced by exploring the scaling effect on model capacity, which can be considered as general-purpose task solvers. To summarize, in the evolution process, the task scope that can be solved by language models have been greatly extended, and the task performance attained by language models have been significantly enhanced. In the existing literature, PLMs have been widely dis- cussed and surveyed [36–39], while LLMs are seldom re- viewed in a systematic way. To motivate our survey, we first highlight three major differences between LLMs and PLMs. First, LLMs display some surprising emergent abilities that may not be observed in previous smaller PLMs. These abili- ties are key to the performance of language models on com- plex tasks, making AI algorithms unprecedently powerful and effective. Second, LLMs would revolutionize the way that humans develop and use AI algorithms. Unlike small PLMs, the major approach to accessing LLMs is through 1. Note that a LLM is not necessarily more capable and emergent abilities may not occur in some LLMs. 2. https://openai.com/blog/chatgpt/ conducts a literature review of the recent advances in LLMs from four major aspects, including pre-training (how to pre- train a capable LLM), adaptation (how to effectively adapt pre-trained LLMs for better use), utilization (how to use LLMs for solving various downstream tasks) and capability evaluation (how to evaluate the abilities of LLMs and existing empirical findings). We thoroughly comb the literature and summarize the key findings, techniques, and methods of LLMs. For this survey, we also create a GitHub project website by collecting the supporting resources for LLMs, at the link https://github.com/RUCAIBox/LLMSurvey. We are also aware of several related review articles on PLMs or LLMs [32, 36, 38, 39, 43, 48–54]. These papers discuss PLMs or some specific (or general) aspects Compared with them, we focus on the techniques and methods to develop and use LLMs and provide a relatively comprehensive reference to important aspects of LLMs. The remainder of this survey is organized as follows: Section 2 introduces the background for LLMs and the evo- lution of GPT-series models, followed by the summarizatio of available resources for developing LLMs in Section 3. Sections 4, 5, 6, and 7 review and summarize the recent progress from the four aspects of pre-training, adaptation, utilization, and capacity evaluation, respectively. Then, Sec- tion 8 discusses the practical guide for prompt design, and Section 9 reviews the applications of LLMs in several representative domains. Finally, we conclude the survey in Section 10 by summarizing the major findings and discuss the remaining issues for future work. 2 OVERVIEW In this section, we present an overview about the back- ground of LLMs and then summarize the technical evolu- tion of the GPT-series models. 2.1 Background for LLMs Typically, large language models (LLMs) refer to Transformer language models that contain hundreds of billions (or more) of parameters4 , which are trained on massive text data , such as GPT-3 , PaLM , Galactica , and LLaMA. LLMs exhibit strong capacities to un- derstand natural language and solve complex tasks (via text generation). To have a quick understanding of how LLMs work, this part introduces the basic background for LLMs, including scaling laws, emergent abilities and key techniques. Formulation of Scaling Laws for LLMs. Currently, LLMs are mainly built upon the Transformer architecture , where multi-head attention layers are stacked in a very deep neural network. Existing LLMs adopt similar Trans- former architectures and pre-training objectives guage modeling) as small language models. However, LLMs significantly extend the model size, data size, and total compute (orders of magnification). Extensive research has 4. In existing literature, there is no formal parameter scale for LLMs, since the model capacity is also related to data size and total compute. In this survey, we take a slightly loose definition of LLMs, and mainly focus on discussing language models with a model size larger than 10B. given an increase in compute budget, the KM scaling law favors a larger budget allocation in model size than the data size, while the Chinchilla scaling law argues that the two sizes should be increased in equal scales, i.e., having similar values for a and b in Equation (3). Discussion on Scaling Laws. After introducing the formu- lations, we continue to discuss scaling law in the following two aspects, to enhance its understanding: Predictable scaling. In practice, scaling law can be used to instruct the training of LLMs, and it has been proven feasible to reliably estimate the performance of larger mod- els based on that of smaller models, called predictable scal- ing. The benefits of predictable scaling for training LLMs are mainly twofold. Firstly, for large models, it is infeasible to rigorously examine various training tricks or variants, and it would be very helpful if experiences gained from small models could also apply to large models. For instance, small proxy models can be trained to find the optimal schedule of the data mixture for large models. Secondly, the training of large-scale models takes a long time, often suffering from issues such as training loss spike, and scaling law can be employed to monitor the training status of LLMs, e.g., identifying abnormal performance at an early time. Despite that scaling law characterizes a smooth trend of performance increase (or loss decrease), it also indicates that diminishing returns7 might occur as model scaling. An empirical study from the OpenAI team has shown that representation quality or semantic content can still effectively improve even if approaching the point of diminishing returns (i.e., approaching the irreducible loss). This finding suggests that training large models are promising for improving the performance of down- stream tasks. To further explore scaling effect, a potential issue is that the amount of available data for training LLMs is actually limited. With the ever-increasing model scale, the public text data would be soon “exhausted” for LLMs. Thus, it will be meaningful to study how scaling laws apply to a data-constrained regime , where data repetition or augmentation might be useful to alleviate data scarcity. Task-level predictability. Existing research of scaling laws are mostly conducted in terms of language modeling loss (e.g., per-token cross-entropy loss in nats ), while in practice we are more concerned about the performance of LLMs on actual tasks. Thus, a basic problem is that how the decrease of language modeling loss translates into the improvement of task performance. Intuitively, a model with a smaller language modeling loss tends to yield a better performance on downstream tasks, since language modeling loss can be considered as a general measure of the overall model capacity. GPT-4 has reported that some capabilities (e.g., coding ability) can be accurately predicted via scaling law. Despite that, readers should be aware that a direct decrease in language modeling loss does not always indicate an improvement of model performance on downstream tasks. Specially, the phenomenon of inverse scaling would occur for some tasks, where task performance surprisingly becomes worse as the language modeling loss decreases. Overall, it is more difficult to explore and 7. https://en.wikipedia.org/wiki/Diminishing returns multiple reasoning steps, e.g., mathematical word problems. In contrast, with the chain-of-thought (CoT) prompting strategy , LLMs can solve such tasks by utilizing the prompting mechanism that involves intermediate reasoning steps for deriving the final answer. This ability is speculated to be potentially obtained by training on code [33, 47]. An empirical study has shown that CoT prompting can bring performance gains (on arithmetic reasoning bench- marks) when applied to PaLM and LaMDA variants with a model size larger than 60B, while its advantage over the standard prompting becomes more evident when the model size exceeds 100B. Furthermore, the performance improvement with CoT prompting seems to be also varied for different tasks, e.g., GSM8K > MAWPS > SWAMP for PaLM. How Emergent Abilities Relate to Scaling Laws. In existing literature [30, 31, 34], scaling laws and emergent abilities provide two perspectives to understand the advantage of large models over small models. In general, scaling law (often measured by language modeling loss) describes pre- dictable performance relation with the potential effect of diminishing returns, while emergent abilities (often mea- sured by task performance) are unpredictable but very prof- itable once such abilities actually emerge. Since the two perspectives reflect different performance trends (continu- ous improvement v.s. sharp performance leap), they might lead to misaligned findings or observations. There are also extensive debates on the rationality of emergent abilities. A popular speculation is that emergent abilities might be partially attributed to the evaluation setting for special tasks (e.g., the discontinuous evaluation metrics) [70, 71]: when evaluation metrics are altered accordingly, the sharpness of the emergent ability curve would disappear. However, the performance of LLMs on most tasks are perceived by users naturally in a discontinuous way. For instance, end users prefer a reliable code generated by LLMs that can success- fully pass the test case, but are less interested in selecting a better code with fewer errors between two failed ones. More recently, a study proposes a new evaluation setting that can enlarge the resolution of task metrics, making task performance more predictable. Despite these efforts, more fundamental research (e.g., grokking10 ) about the working mechanism of LLMs is still in need to understand the emer- gence of certain abilities. The subtle relation between scaling law and emergent abilities can be explained by analogy with the ability acquisition of human11. Take the speaking ability as an example. For children, language development (espe- cially infants) can be also considered as a multi-level process where “emergent abilities” occur. Specially, the language ability would relatively stable within a time interval, but qualitative change only occurs when evolving into another ability level (e.g., from speaking simple words to speaking simple sentences). Such a learning process is essentially not smooth and stable (i.e., language ability does not develop at a constant rate over time), though a child actually grows 10. Grokking refers that “a pattern in the tion performance from random chance level to perfect quoted from the original paper. 11. This explanation is only for ease of understanding, not direct evidence to connect the two points. designs an effective tuning approach that enables LLMs to follow the expected instructions, which utilizes the tech- nique of reinforcement learning with human feedback [66, 79]. It incorporates human in the training loop with elaborately designed labeling strategies. ChatGPT is indeed developed on a similar technique to InstructGPT, which shows a strong alignment capacity in producing high-quality, harmless re- sponses, e.g., rejecting to answer insulting questions. Tools manipulation. In essence, LLMs are trained as text generators over massive plain text corpora, thus performing less well on the tasks that are not best expressed in the form of text (e.g., numerical computation). In addition, their capacities are also limited to the pre-training data, e.g., the inability to capture up-to-date information. To tackle these issues, a recently proposed technique is to employ external tools to compensate for the deficiencies of LLMs [80, 81]. For example, LLMs can utilize the calculator for accurate computation and employ search engines to retrieve unknown information. More recently, ChatGPT has enabled the mechanism of using external plugins (existing or newly created apps)12 , which are by analogy with the “eyes and ears” of LLMs. Such a mechanism can broadly expand the scope of capacities for LLMs. In addition, many other factors (e.g., the upgrade of hardware) also contribute to the success of LLMs. Currently, we limit our discussion to the major technical approaches and key findings for developing LLMs. 2.2 Technical Evolution of GPT-series Models Due to the excellent capacity in communicating with hu- mans, ChatGPT has ignited the excitement of the AI com- munity since its release. ChatGPT is developed based on the powerful GPT model with specially optimized conversation capacities. Considering the ever-growing interest in Chat- GPT and GPT models, we add a special discussion about the technical evolution of the GPT-series models, to briefly sum- marize the progress how they have been developed in the past years. Meanwhile, we drew a schematic diagram de- picting the technological evolution of the GPT-series models in Figure 4. The basic principle underlying GPT models is to compress the world knowledge into the decoder-only Transformer model by language modeling, such that it can recover (or memorize) the semantics of world knowledge and serve as a general-purpose task solver. Two key points to the success are (I) training decoder-only Transformer language models that can accurately predict the next word and (II) scaling up the size of language models. Overall, the research of OpenAI on LLMs can be roughly divided into the following stages13. Early Explorations. According to one interview with Ilya Sutskever14 (a co-founder and chief scientist of OpenAI), the idea of approaching intelligent systems with language 12. https://openai.com/blog/chatgpt-plugins 13. Note that the discussion of this part can The overall viewpoints and summaries are made based standing of the survey authors by reading the papers, interview reports and APIs released by OpenAI. 14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- founder-of-openai TABLE 1: Statistics of large language models (having capacity evaluation, pre-training data scale (either In this table, we only include LLMs with a public date when the corresponding paper was officially publicly accessible while “Closed Source” means subsequent fine-tuning: IT denotes instruction tuning “Evaluation” indicates whether the model has denotes in-context learning and CoT denotes chain-of-thought. Release Size Base Model Time (B) T5 Oct-2019 11 - - - mT5 Oct-2020 13 - - - PanGu-α Apr-2021 13* - - - CPM-2 Jun-2021 198 - - - T0 Oct-2021 11 T5 ✓ - CodeGen Mar-2022 16 - GPT-NeoX-20B Apr-2022 20 - - - Tk-Instruct Apr-2022 11 T5 ✓ - UL2 May-2022 20 - - - OPT May-2022 175 - NLLB Jul-2022 54.5 - - - CodeGeeX Sep-2022 13 - GLM Oct-2022 130 - Flan-T5 Oct-2022 11 T5 ✓ - BLOOM Nov-2022 176 - mT0 Nov-2022 13 mT5 ✓ - Galactica Nov-2022 120 - BLOOMZ Nov-2022 176 BLOOM ✓ - Publicly OPT-IML Dec-2022 175 OPT ✓ - Available LLaMA Feb-2023 65 - Pythia Apr-2023 12 - CodeGen2 May-2023 16 - StarCoder May-2023 15.5 - - - LLaMA2 Jul-2023 70 - ✓ ✓ Baichuan2 Sep-2023 13 - QWEN Sep-2023 14 - ✓ ✓ FLM Sep-2023 101 - Skywork Oct-2023 13 - GPT-3 May-2020 175 - - GShard Jun-2020 600 - - Codex Jul-2021 12 GPT-3 - ERNIE 3.0 Jul-2021 10 - - Jurassic-1 Aug-2021 178 - - HyperCLOVA Sep-2021 82 - - FLAN Sep-2021 137 LaMDA-PT ✓ - Yuan 1.0 Oct-2021 245 - - Anthropic Dec-2021 52 - - WebGPT Dec-2021 175 GPT-3 - ✓ Gopher Dec-2021 280 - - ERNIE 3.0 Titan Dec-2021 260 - - - GLaM Dec-2021 1200 - - LaMDA Jan-2022 137 - - MT-NLG Jan-2022 530 - - Closed AlphaCode Feb-2022 41 - - Source InstructGPT Mar-2022 175 GPT-3 ✓ Chinchilla Mar-2022 70 - - PaLM Apr-2022 540 - - AlexaTM Aug-2022 20 - - Sparrow Sep-2022 70 - - ✓ WeLM Sep-2022 10 - - U-PaLM Oct-2022 540 PaLM - - Flan-PaLM Oct-2022 540 PaLM ✓ - Flan-U-PaLM Oct-2022 540 U-PaLM ✓ GPT-4 Mar-2023 - - ✓ PanGu-Σ Mar-2023 1085 PanGu-α - PaLM2 May-2023 16 - ✓ T5 2019 2020 2021 1-4 GPT-3 Codex 5-8 T0 Anthropic HyperCLOVA WebGPT Ernie 3.0 Titan InstructGPT Gopher CodeGen MT-NLG GLaM OPT CodeGeeX GPT-NeoX-20B BLOOM Tk-Instruct GLM mT0 Cohere AlexaTM BLOOMZ WeLM Galatica OPT-IML Fig. 3: A timeline of existing large language models established mainly according to the release date there was no corresponding paper, we set the date We mark the LLMs with publicly available model checkpoints include the LLMs with publicly reported evaluation GPT-1 GPT-2 2018.06 2019.02 decoder-only architecture unsupervised multitask learner generative pre-training scaling the model size code-davinci-002 +instruction text-davinci-002 2022.03 2022.03 capable code model instruction following Fig. 4: A brief illustration for the technical evolution blog articles and official APIs from OpenAI. Here, statement that a new model is developed based on lines denote a relatively weaker evolution relation. demonstrates a key capacity leap by scaling of the (nearly same) generative pre-training architecture. GPT-3. GPT-3 was released in 2020, which scaled the model parameters to an ever larger size of 175B. In the GPT-3’s paper, it formally introduced the concept of in-context learning (ICL)17 , which utilizes LLMs in a few- shot or zero-shot way. ICL can teach (or instruct) LLMs to understand the tasks in the form of natural language text. With ICL, the pre-training and utilization of LLMs converge to the same language modeling paradigm: pre-training pre- dicts the following text sequence conditioned on the context, while ICL predicts the correct task solution, which can be also formatted as a text sequence, given the task and demonstrations. GPT-3 not only demonstrates very ex- 17. GPT-2 essentially used ICL for unsupervised task learning, though it wasn’t called ICL at that time. ing on code data and alignment with human preference, which are detailed as follows. Training on code data. A major limitation GPT-3 model (pre-trained on plain text) lies in the reasoning ability on complex tasks, e.g., completing code and solving math problems. To enhance this ability, Codex was introduced by OpenAI in July 2021, which was a GPT model fine-tuned on a large corpus of GitHub code. It demonstrated that Codex can solve very difficult programming problems, and also lead to a significant per- formance improvement in solving math problems. Further, a contrastive approach to training text and code embedding was reported in January 2022, which was shown to improve a series of related tasks (i.e., linear- probe classification, text search and code search). Actually, the GPT-3.5 models are developed based on a code-based GPT model (i.e., code-davinci-002), which indicates that training on code data is a very useful practice to improve the model capacity of GPT models, especially the reasoning ability. Furthermore, there is also a speculation that train- ing on code data can greatly increase the chain-of-thought prompting abilities of LLMs , while it is still worth further investigation with more thorough verification. Human alignment. The related research of human alignment can be dated back to the year 2017 (or earlier) for OpenAI: a blog article entitled “learning preferences”18 was posted on the OpenAI blog describing a work that applied reinforcement learning (RL) to learn from the preference comparisons annotated by humans (similar to the reward training step in the aligning algorithm of InstructGPT in Figure 12). Shortly after the release of this RL paper , the paper of the Proximal Policy Optimiza- tion (PPO) was published in July 2017, which now has been the foundational RL algorithm for learning from hu- man preferences. Later in January 2020, GPT-2 was fine- tuned using the aforementioned RL algorithms [79, 128], which leveraged human preferences to improve the capac- ities of GPT-2 on NLP tasks. In the same year, another work trained a summarization model for optimizing human preferences in a similar way. Based on these prior work, InstructGPT was proposed in January 2022 to improve the GPT-3 model for human alignment, which formally established a three-stage reinforcement learning from human feedback (RLHF) algorithm. Note that it seems that the wording of “instruction tuning” has seldom been used in OpenAI’s paper and documentation, which is substituted by supervised fine-tuning on human demonstrations (i.e., the first step of the RLHF algorithm ). In addition to improving the instruction following capacity, the RLHF algorithm is particularly useful to mitigate the issues of generating harm or toxic content for LLMs, which is key to the safe deploy- ment of LLMs in practice. OpenAI describes their approach to alignment research in a technical article , which has summarized three promising directions: “training AI systems to use human feedback, to assist human evaluation and to do alignment research”. These enhancement techniques lead to the improved GPT-3 models with stronger capacities, which are called 18. https://openai.com/research/learning-from-human-preferences November 2023, OpenAI released an upgraded generation of GPT-4 model at DevDay, named GPT-4 Turbo, with a series of technical improvements. GPT-4 Turbo is featured by the improved model capacity (more capable than GPT- 4), the extended knowledge source (up to April 2023), long context window (up to 128k tokens), optimized model performance (cheaper price), and other useful functional- ity updates (function call, reproducible outputs, etc.). At the same time, Assistants API was launched to ease the rapid development of agent-like assistants. With this API, developers can easily create goal-oriented assistants within their applications, by leveraging specific instruction, extra knowledge and tool use. Furthermore, multimodal capaci- ties (see, hear, and speak) were also enhanced in this new release, supported by GPT-4 Turbo with vision, DALL·E 3, Text-to-speech (TTS), and Listen to voice samples. These improvements have greatly extended the capacity scope and enhanced the task performance of GPT models. More impor- tantly, the application ecosystem will be greatly strength- ened with the technology upgrade in improved models, APIs, and functionalities. Despite the huge progress, there are still limitations with these superior LLMs, e.g., generating hallucinations with factual errors or potentially risky response within some specific context. More limitations or issues of LLMs will be discussed in Section 7. It poses long-standing research challenges to develop more capable, safer LLMs. From the perspective of engineering, OpenAI has adopted an iterative deployment strategy to develop the models and products by following a five-stage development and deployment life-cycle, which aims to effectively reduce the potential risks of using the models. In the following, we will dive into the technical details in order to have a specific understanding of how they have been developed. 3 R ESOURCES OF LLM S It is by no means an easy job to develop or reproduce LLMs, considering the challenging technical issues and huge de- mands of computation resources. A feasible way is to learn experiences from existing LLMs and reuse publicly avail- able resources for incremental development or experimental study. In this section, we briefly summarize the publicly available resources for developing LLMs, including model checkpoints (or APIs), corpora and libraries. 3.1 Publicly Available Model Checkpoints or APIs Given the huge cost of model pre-training, well-trained A Survey of Large Language Models Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre- training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in- context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g., containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers. Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation ✦ I NTRODUCTION limits of my language mean the limits of my world.” extensive attention in the literature, which can be divided —Ludwig Wittgenstein into four major development stages: Statistical language models (SLM). SLMs [6–9] are de- veloped based on statistical learning methods that rose in is a prominent ability in human beings to and communicate, which develops in early and evolves over a lifetime [3, 4]. Machines, the 1990s. The basic idea is to build the word prediction model based on the Markov assumption, e.g., predicting the next word based on the most recent context. The SLMs with cannot naturally grasp the abilities of understand- a fixed context length n are also called n-gram language communicating in the form of human language, models, e.g., bigram and trigram language models. SLMs equipped with powerful artificial intelligence (AI) have been widely applied to enhance task performance It has been a longstanding research challenge in information retrieval (IR) [10, 11] and natural language this goal, to enable machines to read, write, and processing (NLP) [12–14]. However, they often suffer from like humans. the curse of dimensionality: it is difficult to accurately language modeling (LM) is one of the major estimate high-order language models since an exponential to advancing language intelligence of machines. number of transition probabilities need to be estimated. LM aims to model the generative likelihood Thus, specially designed smoothing strategies such as back- sequences, so as to predict the probabilities of off estimation and Good–Turing estimation have (or missing) tokens. The research of LM has received been introduced to alleviate the data sparsity problem. Neural language models (NLM). NLMs [1, 17, 18] charac- Version: v14 (major update on September 25, 2024). terize the probability of word sequences by neural networks, GitHub link: https://github.com/RUCAIBox/LLMSurvey Chinese book link: lmbook-zh.github.io e.g., multi-layer perceptron (MLP) and recurrent neural net- * K. Zhou and J. Li contribute equally to this work. works (RNNs). As a remarkable contribution, the work in The authors are mainly with Gaoling School of Artificial Intelligence and introduced the concept of distributed representation of School of Information, Renmin University of China, Beijing, China; Jian- words and built the word prediction function conditioned Yun Nie is with DIRO, Université de Montréal, Canada. Contact e-mail: [email protected] on the aggregated context features (i.e., the distributed The authors of this survey paper reserve all the copyrights of the fig- word vectors). By extending the idea of learning effective ures/tables, and any use of these materials for publication purpose must be features for text data, a general neural network approach officially granted by the survey authors. was developed to build a unified, end-to-end solution for 2 *37 *37 //D0$ //D0$ &KDW*37 &KDW*37 *37 *37 ,QVWUXFW*37 ,QVWUXFW*37 //D0$ //D0$ &RGH[ &RGH[ &KDW*37 &KDW*37 *37 *37 ,QVWUXFW*37 ,QVWUXFW*37 &RGH[ &RGH[ 77 *37 *37 7L7LPPH H 7L7LPPHH (b) Query=”Large Language Model” arXiv papers that contain the keyphrases “language model” (since June 2018) respectively. The statistics are calculated using exact match by querying set different x-axis ranges for the two keyphrases, because “language We label the points corresponding to important landmarks in the research the release of ChatGPT: the average number of published arXiv papers or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)). General-purpose Transferable task solver Task-agnostic NLP task solver GPT-3/4!ChatGPT!Claude Scaling language models feature learner ELMO!BERT!GPT-1/2 Prompt based completion Word2vec (NPLM)!NLPS Context-aware representations Solve various real-world tasks Static word representations Pre-training + fine-tuning Solve various NLP tasks LLM Neural context modeling Probability estimation Solve typical NLP tasks Pre-trained LM Assist in specific tasks Neural LM Statistical LM 1990s 2013 2018 2020 of language models (LM) from the perspective of task solving capacity. be very accurate, and we set the time mainly according to the publish stage. For neural language models, we abbreviate the paper titles of (“A neural probabilistic language model”) and NLPS Due to the space limitation, we don’t list all representative studies in designed pre-training tasks on large-scale unlabeled cor- pora. These pre-trained context-aware word representations are very effective as general-purpose semantic features, which have largely raised the performance bar of NLP tasks. This study has inspired a large number of follow-up work, which sets the “pre-training and fine-tuning” learning paradigm. Following this paradigm, a great number of stud- ies on PLMs have been developed, introducing either differ- ent architectures [24, 25] (e.g., GPT-2 and BART ) or improved pre-training strategies [27–29]. In this paradigm, it often requires fine-tuning the PLM for adapting to different downstream tasks. Large language models (LLM). Researchers find that scaling PLM (e.g., scaling model size or data size) often leads to an improved model capacity on downstream tasks (i.e., following the scaling law ). A number of studies 3 the prompting interface (e.g., GPT-4 API). Humans have to understand how LLMs work and format their tasks in a way that LLMs can follow. Third, the development of LLMs no longer draws a clear distinction between research and en- gineering. The training of LLMs requires extensive practical experiences in large-scale data processing and distributed parallel training. To develop capable LLMs, researchers have to solve complicated engineering issues, working with engineers or being engineers. Nowadays, LLMs are posing a significant impact on the AI community, and the advent of ChatGPT and GPT-4 leads to the rethinking of the possibilities of artificial general intelligence (AGI). OpenAI has published a technical article entitled “Planning for AGI and beyond”, which discusses the short-term and long-term plans to approach AGI , and a more recent paper has argued that GPT-4 might be considered as an early version of an AGI system. The research areas of AI are being revolutionized by the rapid progress of LLMs. In the field of NLP, LLMs can serve as a general-purpose language task solver (to some extent), and the research paradigm has been shifting towards the use of LLMs. In the field of IR, traditional search engines are challenged by the new information seeking way through AI chatbots (i.e., ChatGPT), and New Bing3 presents an initial attempt that enhances the search results based on LLMs. In the field of CV, the researchers try to develop ChatGPT-like vision-language models that can better serve multimodal dialogues [42–45], and GPT-4 has supported multi- modal input by integrating the visual information. This new wave of technology would potentially lead to a prosperous ecosystem of real-world applications based on LLMs. For instance, Microsoft 365 is being empowered by LLMs (i.e., Copilot) to automate the office work, and OpenAI supports the use of plugins in ChatGPT for implementing special functions. Despite the progress and impact, the underlying prin- ciples of LLMs are still not well explored. Firstly, it is mysterious why emergent abilities occur in LLMs, instead of smaller PLMs. As a more general issue, there lacks a deep, detailed investigation of the key factors that contribute to the superior abilities of LLMs. It is important to study when and how LLMs obtain such abilities. Although there are some meaningful discussions about this problem [31, 47], more principled investigations are needed to uncover the “secrets“ of LLMs. Secondly, it is difficult for the research community to train capable LLMs. Due to the huge de- mand of computation resources, it is very costly to carry out repetitive, ablating studies for investigating the effect of various strategies for training LLMs. Indeed, LLMs are mainly trained by industry, where many important training details (e.g., data collection and cleaning) are not revealed to the public. Thirdly, it is challenging to align LLMs with human values or preferences. Despite the capacities, LLMs are also likely to produce toxic, fictitious, or harmful con- tents. It requires effective and efficient control approaches to eliminating the potential risk of the use of LLMs. Faced with both opportunities and challenges, it needs more attention on the research and development of LLMs. In than a small PLM, order to provide a basic understanding of LLMs, this survey 3. https://www.bing.com/new 4 shown that scaling can largely improve the model capacity of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita- tive approach to characterizing the scaling effect. Next, we introduce two representative scaling laws for Transformer language models [30, 34]. KM scaling law5. In 2020, Kaplan et al. (the OpenAI team) firstly proposed to model the power-law relationship of model performance with respective to three major factors, namely model size (N ), dataset size (D), and the amount of training compute (C ), for neural language models. Given a compute budget c, they empirically presented three basic formulas for the scaling law6 : either α of LLMs. Nc N L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1) N α Dc D L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013 D αC Cc L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108 C n where L(·) denotes the cross entropy loss in nats, and a follow-up study from OpenAI has shown that the language modeling loss can be decomposed into two parts, namely irreducible loss (the entropy of the true data distri- bution) and reducible loss (an estimate of the KL divergence between the true and model distributions). The three laws were derived by fitting the model performance with varied data sizes (22M to 23B tokens), model sizes (768 to 1.5B non- embedding parameters) and training compute, under some assumptions (e.g., the analysis of one factor should be not bottlenecked by the other two factors). They showed that the model performance has a strong dependence relation on the three factors. Chinchilla scaling law. As another representative study, Hoffmann et al. (the Google DeepMind team) proposed an alternative form for scaling laws to instruct the compute- optimal training for LLMs. They conducted rigorous exper- iments by varying a larger range of model sizes (70M to 16B) and data sizes (5B to 500B tokens), and fitted a similar scaling law yet with different coefficients as below : A B L(N, D) = E + + β, (2) Nα D where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and β = 0.28. By optimizing the loss L(N, D) under the con- straint C ≈ 6N D, they showed that the optimal allocation of compute budget to model size and data size can be derived as follows: a b C C Nopt (C) = G , Dopt (C) = G−1 , (3) 6 6 α β where a = α+β , b = α+β and G is a scaling coefficient that (e.g., lan- can be computed by A, B , α and β. As analyzed in , 5. Since there was not a model trained following this law in the original paper, we took the last names of the two co-first authors to name this scaling law. consensus on the minimum 6. Here, Nc , Dc and Cc are measured in the number of non- embedding parameters, the number of training tokens and the number of FP-days, respectively. According to the original paper , Cc and C should be denoted by Ccmin and Cmin , corresponding to the optimal use of compute. We use the simplified notations for ease of discussions. 5 characterize task-level scaling laws, since it might be also dependent on task-related information (task metric, task difficulty, etc.). Furthermore, some capacities (e.g., in-context learning ) are unpredictable according to the scaling law, which can be observed only when the model size exceeds a certain level (as discussed below). Emergent Abilities of LLMs. In the literature , emergent abilities of LLMs are formally defined as “the abilities that are not present in small models but arise in large models”, which is one of the most prominent features that distin- guish LLMs from previous PLMs. It further introduces a notable characteristic when emergent abilities occur : performance rises significantly above random when the scale reaches a certain level. By analogy, such an emergent pattern has close connections with the phenomenon of phase transition in physics [31, 63]. In principle, emergent abilities can be defined in relation to some complex tasks [31, 64], while we are more concerned with general abilities that can be applied to solve a variety of tasks. Here, we briefly introduce three typical emergent abilities for LLMs and representative models that possess such an ability8. In-context learning. The in-context learning (ICL) ability is formally introduced by GPT-3 : assuming that the language model has been provided with a natural language instruction and/or several task demonstrations, it can gen- erate the expected output for the test instances by com- pleting the word sequence of input text, without requiring additional training or gradient update9. Among the GPT- series models, the 175B GPT-3 model exhibited a strong ICL ability in general, but not the GPT-1 and GPT-2 models. Such an ability also depends on the specific downstream task. For example, the ICL ability can emerge on the arithmetic tasks (e.g., the 3-digit addition and subtraction) for the 13B GPT-3, but 175B GPT-3 even cannot work well on the Persian QA task. Instruction following. By fine-tuning with a mixture of multi-task datasets formatted via natural language descrip- tions (called instruction tuning), LLMs are shown to perform well on unseen tasks that are also described in the form of instructions [28, 66, 67]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved generalization ability. According to the experiments in , instruction-tuned LaMDA-PT started to significantly outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8B or smaller model sizes. A recent study found that a model size of 62B is at least required for PaLM to perform well on various tasks in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller size might suffice for some specific tasks (e.g., MMLU). Step-by-step reasoning. For small language models, it is usually difficult to solve complex tasks that involve 8. It is difficult to accurately examine the critical size for emergent abilities of LLMs (i.e., the minimum size to possess an ability), since it might vary for different models or tasks. Also, existing studies often test emergent abilities on very limited model sizes for a specific LLM. For example, PaLM is often tested with three sizes of 8B, 62B and 540B. It is unclear about the model performance of the untested sizes. 9. In a recent study , it also shows that in-context learning implic- itly performs meta-optimization through the attention mechanism. 6 every day. It is interesting that young parents would be often surprised by unexpected progress of the speaking ability exhibited by their babies. Key Techniques for LLMs. It has been a long way that LLMs evolve into the current state: general and capable learners. In the development process, a number of impor- tant techniques are proposed, which largely improve the capacity of LLMs. Here, we briefly list several important techniques that (potentially) lead to the success of LLMs, as follows. Scaling. As discussed in previous parts, there exists an evident scaling effect in Transformer language mod- els: larger model/data sizes and more training compute typically lead to an improved model capacity [30, 34]. As two representative models, GPT-3 and PaLM explored the scaling limits by increasing the model size to 175B and 540B, respectively. Since compute budget is usually limited, scaling laws can be further employed to conduct a more compute-efficient allocation of the compute resources. For example, Chinchilla (with more training tokens) outper- forms its counterpart model Gopher (with a larger model size) by increasing the data scale with the same compute budget. In addition, data scaling should be with careful cleaning process, since the quality of pre-training data plays a key role in the model capacity. Training. Due to the huge model size, it is very chal- lenging to successfully train a capable LLM. Distributed training algorithms are needed to learn the network param- eters of LLMs, in which various parallel strategies are of- ten jointly utilized. To support distributed training, several optimization frameworks have been released to facilitate the implementation and deployment of parallel algorithms, such as DeepSpeed and Megatron-LM [75–77]. Also, op- timization tricks are also important for training stability and model performance, e.g., restart to overcome training loss spike and mixed precision training. More recently, GPT-4 proposes to develop special infrastructure and optimization methods that reliably predict the performance of large models with much smaller models. Ability eliciting. After being pre-trained on large-scale corpora, LLMs are endowed with potential abilities as general-purpose task solvers. These abilities might not be explicitly exhibited when LLMs perform some specific tasks. As the technical approach, it is useful to design suitable task instructions or specific in-context learning strategies to elicit such abilities. For instance, chain-of-thought prompting has been shown to be useful to solve complex reasoning tasks by including intermediate reasoning steps. Furthermore, we can perform instruction tuning on LLMs with task descriptions expressed in natural language, for improving the generalizability of LLMs on unseen tasks. These eliciting techniques mainly correspond to the emergent abilities of LLMs, which may not show the same effect on small lan- guage models. Alignment tuning. Since LLMs are trained to capture the data characteristics of pre-training corpora (including data, improving generaliza- both high-quality and low-quality data), they are likely to generalization”, generate toxic, biased, or even harmful content for humans. and there is It is necessary to align LLMs with human values, e.g., helpful, honest, and harmless. For this purpose, InstructGPT 7 models was already explored in the early days of Ope- nAI, while it was attempted with recurrent neural net- works (RNN). With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 and GPT-2 , which can be considered as the foundation to more powerful models subsequently i.e., GPT-3 and GPT-4. GPT-1. In 2017, the Transformer model was intro- duced by Google, and the OpenAI team quickly adapted their language modeling work to this new neural network architecture. They released the first GPT model in 2018, i.e., GPT-1 , and coined the abbreviation term GPT as the model name, standing for Generative Pre-Training. GPT-1 was developed based on a generative, decoder-only Transformer architecture, and adopted a hybrid approach of unsupervised pre-training and supervised fine-tuning. GPT- 1 has set up the core architecture for the GPT-series models and established the underlying principle to model natural language text, i.e., predicting the next word. GPT-2. Following a similar architecture of GPT-1, GPT-2 increased the parameter scale to 1.5B, which was trained with a large webpage dataset WebText. As claimed in the paper of GPT-2, it sought to perform tasks via unsupervised language modeling, without explicit fine-tuning using labeled data. To motivate the approach, they introduced a probabilistic form for multi-task solving, i.e., p(output|input, task) (similar approaches have been adopted in ), which predicts the output conditioned on the input and task information. To model this conditional probability, language text can be naturally employed as a unified way to format input, output and task information. In this way, the process of solving a task can be cast as a word prediction problem for generating the solution text. Further, they introduced a more formal claim for this idea: “Since the (task-specific) supervised objective is the same as the unsupervised (language modeling) objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective (for various tasks)” 15. A basic understanding of this claim is that each (NLP) task can be considered as the word prediction problem based on a subset of the world text. Thus, unsupervised language modeling could be capable in solving various tasks, if it was trained to have sufficient capacity in recovering the world text. These early discussion in GPT-2’s paper echoed in the interview of Ilya Sutskever by Jensen Huang: “What the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world...the more accurate you are in predicting the next word, the higher the fidelity, the more resolution you get in this process...”16. Capacity Leap. Although GPT-2 is intended to be an “un- supervised multitask learner”, it overall has an inferior performance compared with supervised fine-tuning state- of-the-art methods. Because it has a relatively small model size, it has been widely fine-tuned in downstream tasks, be somewhat subjective. especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3 on the under- blog articles, 15. To better understand this sentence, we put some explanation words in parentheses. 16. https://lifearchitect.ai/ilya/ 8 a size larger than 10B in this survey) in recent years, including the in the number of tokens or storage size) and hardware resource costs. paper about the technical details. Here, “Release Time” indicates the released. “Publicly Available” means that the model checkpoints can be the opposite. “Adaptation” indicates whether the model has been with and RLHF denotes reinforcement learning with human feedback. been evaluated with corresponding abilities in their original paper: ICL “*” denotes the largest publicly available version. Adaptation Pre-train Latest Data Hardware Training Evaluation Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT 1T tokens Apr-2019 1024 TPU v3 - ✓ - 1T tokens - - - ✓ - 1.1TB - 2048 Ascend 910 - ✓ - 2.6TB - - - - - - - 512 TPU v3 27 h ✓ - - - 577B tokens - - - ✓ - 825GB - 96 40G A100 - ✓ - - - 256 TPU v3 4h ✓ - 1T tokens Apr-2019 512 TPU v4 - ✓ ✓ - - 180B tokens - 992 80G A100 - ✓ - - - - - ✓ - - - 850B tokens - 1536 Ascend 910 60 d ✓ - - - 400B tokens - 768 40G A100 60 d ✓ - - - - - ✓ ✓ - - 366B tokens - 384 80G A100 105 d ✓ - - - - - ✓ - - - 106B tokens - - - ✓ ✓ - - - - ✓ - - - 128 40G A100 - ✓ ✓ - - 1.4T tokens - 2048 80G A100 21 d ✓ - - - 300B tokens - 256 40G A100 - ✓ - - - 400B tokens - - - ✓ - 1T tokens - 512 40G A100 - ✓ ✓ 2T tokens - 2000 80G A100 - ✓ - ✓ ✓ 2.6T tokens - 1024 A800 - ✓ - 3T tokens - - - ✓ - ✓ - 311B tokens - 192 A800 22 d ✓ - - - 3.2T tokens - 512 80G A800 - ✓ - - 300B tokens - - - ✓ - - 1T tokens - 2048 TPU v3 4d - - - 100B tokens May-2020 - - ✓ - - 375B tokens - 384 V100 - ✓ - - 300B tokens - 800 GPU - ✓ - - 300B tokens - 1024 A100 13.4 d ✓ - - - 128 TPU v3 60 h ✓ - - 180B tokens - 2128 GPU - ✓ - - 400B tokens - - - ✓ - - - - - ✓ - - 300B tokens - 4096 TPU v3 920 h ✓ - - - - - ✓ - - 280B tokens - 1024 TPU v4 574 h ✓ - - 768B tokens - 1024 TPU v3 57.7 d - - - 270B tokens - 4480 80G A100 - ✓ - - 967B tokens Jul-2021 - - - - ✓ - - - - ✓ - - 1.4T tokens - - - ✓ - - 780B tokens - 6144 TPU v4 - ✓ ✓ - 1.3T tokens - 128 A100 120 d ✓ ✓ - - 64 TPU v3 - ✓ - - 300B tokens - 128 A100 40G 24 d ✓ - - - 512 TPU v4 5d ✓ ✓ - - 512 TPU v4 37 h ✓ ✓ - - - - - ✓ ✓ ✓ - - - - ✓ ✓ - 329B tokens - 512 Ascend 910 100 d ✓ - - 100B tokens - - - ✓ ✓ 9 GShard Publicly Available mT5 PanGu-𝛂 Ernie 3.0 YuLan-Chat Jurassic-1 PLUG StarCoder CPM-2 FLAN CodeGen2 9-10 LaMDA Yuan 1.0 ChatGLM AlphaCode 11-12 Falcon Chinchilla 2022 PaLM2 UL2 Sparrow Pythia InternLM 1-3 Qwen2 PaLM Flan-T5 Qwen Vicuna DeepSeek-V2 YaLM Flan-PaLM Mistral PanGu-Σ LLaMA3 4-6 Luminous Bard Deepseek MiniCPM 7-10 NLLB LLaMA Mixtral Gemma 11-12 2023 1-6 7-12 2024 1-6 ChatGPT GPT-4 LLaMA2 (having a size larger than 10B) in recent years. The timeline was (e.g., the submission date to arXiv) of the technical paper for a model. If of a model as the earliest time of its public release or announcement. in yellow color. Due to the space limit of the figure, we only results. ChatGPT GPT-3 +code Codex GPT-3.5 GPT-4 2020.05 2021.07 2022.03 2023.03 in-context learning code pre-training strong reasoning ability exploring scaling limits GPT-4 Turbo 2023.09 longer context window +RLHF text-davinci-003 +chat gpt-3.5-turbo 2022.09 2023.03 GPT-4 Turbo with vision 2023.09 human alignment excellent comprehensive ability multimodal ability of GPT-series models. We plot this figure mainly based on the papers, solid lines denote that there exists an explicit evidence (e.g., the official a base model) on the evolution path between two models, while dashed cellent performance in a variety of NLP tasks, but also on a number of specially designed tasks that require the abilities of reasoning or domain adaptation. Although the GPT-3’s paper does not explicitly discuss the emergent abilities of LLMs, we can observe large performance leap that might transcend the basic scaling law , e.g., larger models have significantly stronger ICL ability (illustrated in the original Figure 1.2 of the GPT-3’s paper ). Overall, GPT-3 can be viewed as a remarkable landmark in the journey evolving from PLMs to LLMs. It has empirically proved that scaling the neural networks to a significant size can lead to a huge increase in model capacity. description Capacity Enhancement. Due to the strong capacities, GPT- 3 has been the base model to develop even more capable LLMs for OpenAI. Overall, OpenAI has explored two major approaches to further improving the GPT-3 model, i.e., train- 10 GPT-3.5 models by OpenAI (see the discussion about the OpenAI API in Section 3.1). of the original The Milestones of Language Models. Based on all the ex- the lack of ploration efforts, two major milestones have been achieved the by OpenAI, namely ChatGPT and GPT-4 , which have largely raised the capacity bar of existing AI systems. ChatGPT. In November 2022, OpenAI released the conversation model ChatGPT, based on the GPT models (GPT-3.5 and GPT-4). As the official blog article intro- duced , ChatGPT was trained in a similar way as InstructGPT (called “a sibling model to InstructGPT” in the original post), while specially optimized for dialogue. They reported a difference between the training of ChatGPT and InstructGPT in the data collection setup: human-generated conversations (playing both the roles of user and AI) are combined with the InstructGPT dataset in a dialogue format for training ChatGPT. ChatGPT exhibited superior capaci- ties in communicating with humans: possessing a vast store of knowledge, skill at reasoning on mathematical problems, tracing the context accurately in multi-turn dialogues, and aligning well with human values for safe use. Later on, the plugin mechanism has been supported in ChatGPT, which further extends the capacities of ChatGPT with existing tools or apps. So far, it seems to be the ever most powerful chatbot in the AI history. The launch of ChatGPT has a significant from human impact on the AI research in the future, which sheds light on the exploration of human-like AI systems. GPT-4. As another remarkable progress, GPT-4 was released in March 2023, which extended the text input to multimodal signals. Overall, GPT-4 has stronger capac- ities in solving complex tasks than GPT-3.5, showing a large performance improvement on many evaluation tasks. A recent study investigated the capacities of GPT- 4 by conducting qualitative tests with human-generated problems, spanning a diverse range of difficult tasks, and showed that GPT-4 can achieve more superior performance than prior GPT models. Furthermore, GPT-4 responds more safely to malicious or provocative queries, due to a six- month iterative alignment (with an additional safety re- ward signal in the RLHF training). In the technical report, OpenAI has emphasized how to safely develop GPT-4 and applied a number of intervention strategies to mitigate the possible issues of LLMs, such as hallucinations, privacy and overreliance. For example, they introduced the mech- anism called red teaming to reduce the harm or toxic content generation. As another important aspect, GPT-4 has been developed on a well-established deep learning infrastructure with improved optimization methods. They introduced a new mechanism called predictable scaling that can accurately predict the final performance with a small proportion of compute during model training. GPT-4V, GPT-4 turbo, and beyond. Based on the work done for GPT-4 , OpenAI further released GPT-4V in September 2023, which focused on the safe deployment of the vision capabilities of GPT-4. In the GPT-4V’s system card , it has extensively discussed the assessment and mitigation of risks related to visually augmented inputs. Specially, GPT-4V exhibited strong vision capacities in var- ious application scenarios, showing the great potential as a powerful multimodal learning system. More recently, in 11 model, and its performance evaluation in downstream tasks. For more details of LLMs, see Table 1. LLaMA. The LLaMA series of models has gained im- mense popularity and widespread attention due to its open- ness and effectiveness. From LLaMA , LLaMA-2 , LLaMA-3 to LLaMA-3.1 , continuous updates have been made and the development is still ongoing. With increased parameters (the largest version has 405B), more pre-training tokens (15T tokens), and an extended context window (128K), LLaMA-3.1 has significantly enhanced its capabilities, and it also integrates additional components that work in synergy with the model, including new se- curity and safety tools. In evaluation, LLaMa-3.1 (405B ver- sion) achieves competitive performance against prominent closed-source LLMs, such as GPT-4, GPT-4o, and Claude 3.5 Sonnet in various benchmarks (e.g., MMLU, GSM8k, and HumanEval). The pre-training of LLaMA (65B version) involves 2,048 A100-80G GPUs, whereas LLaMA-3.1 (405B version) involves more than 16,000 H100 GPUs. Mistral. The Mistral series [137, 138] consist of Mis- tral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and Mixtral (8×7B and 8×22B), which have been widely known for their strong performance on various mainstream bench- marks (e.g., MMLU and GSM8k). Mistral NeMo is featured with a long context window of 128K at the parameter scale of 12B. Although Mistral NeMo is trained with quantization awareness, it enables FP8 inference without sacrificing per- formance. Mistral Large 2 is the largest and most powerful model of the Mistral series, which supports 11 natural languages and more than 80 programming languages. Mix- tral is a kind of sparse Mixture-of-Experts (SMoE) model that activates only part of the parameters during inference, making it more efficient compared to dense models of the same size. Gemma. Gemma [139, 140] is a series of lightweight, strong, and open models, consisting of Gemma-1 (2B and 7B) and Gemma-2 (2B, 9B, and 27B). During the pre-training stage, Gemma-2 2B, 9B, and 27B versions are trained on 2T, 8T, and 13T primarily English tokens, respectively. The largest version of Gemma-2 is trained on 6144 TPUv5p chips. Gemma-2 has achieved excellent performance in mul- tiple benchmarks (e.g., ARC-c, MMLU, and GSM8k). Qwen. Qwen [141, 142] is an open-source large model series consisting of Qwen (raging from 7B to 72B), Qwen1.5 (raging from 0.5B to 110B), Qwen2 (ranging from 0.5B to 72B), and Qwen2.5 (ranging from 0.5B to 72B). Qwen2.5 is the newest LLM collection of Qwen, which is pre-trained on up to 18T tokens. Compared to Qwen2, Qwen2.5 demonstrates a significant increase

A Survey of Large Language Models PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue