A Survey of Large Language Models PDF
Document Details
Uploaded by Deleted User
Gaoling School of Artificial Intelligence and School of Information, Renmin University of China
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang
Tags
Summary
This survey paper reviews the recent advances in large language models (LLMs), tracing their development from statistical models to neural network-based architectures. The authors delve into various aspects of LLMs, including pre-training techniques, adaptation tuning strategies, practical applications, and methods for evaluating their capabilities. They also discuss available resources and future research directions for LLMs.
Full Transcript
1 A Survey of Large Language Models Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, J...
1 A Survey of Large Language Models Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre- training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) arXiv:2303.18223v15 [cs.CL] 13 Oct 2024 tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in- context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g., containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers. Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation ✦ 1 I NTRODUCTION “The limits of my language mean the limits of my world.” extensive attention in the literature, which can be divided —Ludwig Wittgenstein into four major development stages: Statistical language models (SLM). SLMs [6–9] are de- veloped based on statistical learning methods that rose in L ANGUAGE is a prominent ability in human beings to express and communicate, which develops in early childhood and evolves over a lifetime [3, 4]. Machines, the 1990s. The basic idea is to build the word prediction model based on the Markov assumption, e.g., predicting the next word based on the most recent context. The SLMs with however, cannot naturally grasp the abilities of understand- a fixed context length n are also called n-gram language ing and communicating in the form of human language, models, e.g., bigram and trigram language models. SLMs unless equipped with powerful artificial intelligence (AI) have been widely applied to enhance task performance algorithms. It has been a longstanding research challenge in information retrieval (IR) [10, 11] and natural language to achieve this goal, to enable machines to read, write, and processing (NLP) [12–14]. However, they often suffer from communicate like humans. the curse of dimensionality: it is difficult to accurately Technically, language modeling (LM) is one of the major estimate high-order language models since an exponential approaches to advancing language intelligence of machines. number of transition probabilities need to be estimated. In general, LM aims to model the generative likelihood Thus, specially designed smoothing strategies such as back- of word sequences, so as to predict the probabilities of off estimation and Good–Turing estimation have future (or missing) tokens. The research of LM has received been introduced to alleviate the data sparsity problem. Neural language models (NLM). NLMs [1, 17, 18] charac- Version: v14 (major update on September 25, 2024). terize the probability of word sequences by neural networks, GitHub link: https://github.com/RUCAIBox/LLMSurvey Chinese book link: lmbook-zh.github.io e.g., multi-layer perceptron (MLP) and recurrent neural net- * K. Zhou and J. Li contribute equally to this work. works (RNNs). As a remarkable contribution, the work in The authors are mainly with Gaoling School of Artificial Intelligence and introduced the concept of distributed representation of School of Information, Renmin University of China, Beijing, China; Jian- words and built the word prediction function conditioned Yun Nie is with DIRO, Université de Montréal, Canada. Contact e-mail: [email protected] on the aggregated context features (i.e., the distributed The authors of this survey paper reserve all the copyrights of the fig- word vectors). By extending the idea of learning effective ures/tables, and any use of these materials for publication purpose must be features for text data, a general neural network approach officially granted by the survey authors. was developed to build a unified, end-to-end solution for 2 *37 *37 //D0$ //D0$ &KDW*37 &KDW*37 *37 *37 ,QVWUXFW*37 ,QVWUXFW*37 //D0$ //D0$ &RGH[ &RGH[ &KDW*37 &KDW*37 *37 *37 77 ,QVWUXFW*37 ,QVWUXFW*37 *37 *37 *37 *37 &RGH[ &RGH[ 77 *37 *37 %(57 %(57 7L7LPPH H 7L7LPPHH (a) Query=”Language Model” (b) Query=”Large Language Model” Fig. 1: The trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018) and “large language model” (since October 2019), respectively. The statistics are calculated using exact match by querying the keyphrases in title or abstract by months. We set different x-axis ranges for the two keyphrases, because “language models” have been explored at an earlier time. We label the points corresponding to important landmarks in the research progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)). General-purpose Transferable task solver Task-agnostic NLP task solver GPT-3/4!ChatGPT!Claude Scaling language models Task Specific task feature learner ELMO!BERT!GPT-1/2 Prompt based completion solving helper Word2vec (NPLM)!NLPS Context-aware representations Solve various real-world tasks capacity n-gram models Static word representations Pre-training + fine-tuning Solve various NLP tasks LLM Statistical methods Neural context modeling Probability estimation Solve typical NLP tasks Pre-trained LM Assist in specific tasks Neural LM Statistical LM 1990s 2013 2018 2020 Fig. 2: An evolution process of the four generations of language models (LM) from the perspective of task solving capacity. Note that the time period for each stage may not be very accurate, and we set the time mainly according to the publish date of the most representative studies at each stage. For neural language models, we abbreviate the paper titles of two representative studies to name the two approaches: NPLM (“A neural probabilistic language model”) and NLPS (“Natural language processing (almost) from scratch”). Due to the space limitation, we don’t list all representative studies in this figure. various NLP tasks. Furthermore, word2vec [19, 20] was designed pre-training tasks on large-scale unlabeled cor- proposed to build a simplified shallow neural network pora. These pre-trained context-aware word representations for learning distributed word representations, which were are very effective as general-purpose semantic features, demonstrated to be very effective across a variety of NLP which have largely raised the performance bar of NLP tasks. These studies have initiated the use of language tasks. This study has inspired a large number of follow-up models for representation learning (beyond word sequence work, which sets the “pre-training and fine-tuning” learning modeling), having an important impact on the field of NLP. paradigm. Following this paradigm, a great number of stud- ies on PLMs have been developed, introducing either differ- Pre-trained language models (PLM). As an early at- ent architectures [24, 25] (e.g., GPT-2 and BART ) or tempt, ELMo was proposed to capture context-aware improved pre-training strategies [27–29]. In this paradigm, it word representations by first pre-training a bidirectional often requires fine-tuning the PLM for adapting to different LSTM (biLSTM) network (instead of learning fixed word downstream tasks. representations) and then fine-tuning the biLSTM network according to specific downstream tasks. Furthermore, based Large language models (LLM). Researchers find that on the highly parallelizable Transformer architecture scaling PLM (e.g., scaling model size or data size) often with self-attention mechanisms, BERT was proposed by leads to an improved model capacity on downstream tasks pre-training bidirectional language models with specially (i.e., following the scaling law ). A number of studies 3 have explored the performance limit by training an ever the prompting interface (e.g., GPT-4 API). Humans have to larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- understand how LLMs work and format their tasks in a way parameter PaLM). Although scaling is mainly conducted that LLMs can follow. Third, the development of LLMs no in model size (with similar architectures and pre-training longer draws a clear distinction between research and en- tasks), these large-sized PLMs display different behaviors gineering. The training of LLMs requires extensive practical from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- experiences in large-scale data processing and distributed parameter GPT-2) and show surprising abilities (called emer- parallel training. To develop capable LLMs, researchers gent abilities ) in solving a series of complex tasks. For have to solve complicated engineering issues, working with example, GPT-3 can solve few-shot tasks through in-context engineers or being engineers. learning, whereas GPT-2 cannot do well. Thus, the research Nowadays, LLMs are posing a significant impact on community coins the term “large language models (LLM)”1 the AI community, and the advent of ChatGPT and GPT-4 for these large-sized PLMs [32–35], which attract increasing leads to the rethinking of the possibilities of artificial general research attention (See Figure 1). A remarkable application intelligence (AGI). OpenAI has published a technical article of LLMs is ChatGPT2 that adapts the LLMs from the GPT entitled “Planning for AGI and beyond”, which discusses series for dialogue, which presents an amazing conversation the short-term and long-term plans to approach AGI , ability with humans. We can observe a sharp increase of the and a more recent paper has argued that GPT-4 might be arXiv papers that are related to LLMs after the release of considered as an early version of an AGI system. The ChatGPT in Figure 1. research areas of AI are being revolutionized by the rapid As discussed before, language model is not a new tech- progress of LLMs. In the field of NLP, LLMs can serve as a nical concept specially for LLMs, but has evolved with the general-purpose language task solver (to some extent), and advance of artificial intelligence over the decades. Early lan- the research paradigm has been shifting towards the use guage models mainly aim to model and generate text data, of LLMs. In the field of IR, traditional search engines are while latest language models (e.g., GPT-4) focus on complex challenged by the new information seeking way through AI task solving. From language modeling to task solving, it is an chatbots (i.e., ChatGPT), and New Bing3 presents an initial important leap in scientific thinking, which is the key to attempt that enhances the search results based on LLMs. In understand the development of language models in the re- the field of CV, the researchers try to develop ChatGPT-like search history. From the perspective of task solving, the four vision-language models that can better serve multimodal generations of language models have exhibited different lev- dialogues [42–45], and GPT-4 has supported multi- els of model capacities. In Figure 2, we describe the evolu- modal input by integrating the visual information. This new tion process of language models in terms of the task solving wave of technology would potentially lead to a prosperous capacity. At first, statistical language models mainly assisted ecosystem of real-world applications based on LLMs. For in some specific tasks (e.g., retrieval or speech tasks), in instance, Microsoft 365 is being empowered by LLMs (i.e., which the predicted or estimated probabilities can enhance Copilot) to automate the office work, and OpenAI supports the performance of task-specific approaches. Subsequently, the use of plugins in ChatGPT for implementing special neural language models focused on learning task-agnostic functions. representations (e.g., features), aiming to reduce the efforts Despite the progress and impact, the underlying prin- for human feature engineering. Furthermore, pre-trained ciples of LLMs are still not well explored. Firstly, it is language models learned context-aware representations that mysterious why emergent abilities occur in LLMs, instead of can be optimized according to downstream tasks. For the smaller PLMs. As a more general issue, there lacks a deep, latest generation of language model, LLMs are enhanced by detailed investigation of the key factors that contribute to exploring the scaling effect on model capacity, which can be the superior abilities of LLMs. It is important to study when considered as general-purpose task solvers. To summarize, and how LLMs obtain such abilities. Although there are in the evolution process, the task scope that can be solved some meaningful discussions about this problem [31, 47], by language models have been greatly extended, and the more principled investigations are needed to uncover the task performance attained by language models have been “secrets“ of LLMs. Secondly, it is difficult for the research significantly enhanced. community to train capable LLMs. Due to the huge de- In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con- and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs. PLMs, the major approach to accessing LLMs is through Faced with both opportunities and challenges, it needs more attention on the research and development of LLMs. In 1. Note that a LLM is not necessarily more capable than a small PLM, order to provide a basic understanding of LLMs, this survey and emergent abilities may not occur in some LLMs. 2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new 4 conducts a literature review of the recent advances in LLMs shown that scaling can largely improve the model capacity from four major aspects, including pre-training (how to pre- of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita- train a capable LLM), adaptation (how to effectively adapt tive approach to characterizing the scaling effect. Next, we pre-trained LLMs for better use), utilization (how to use introduce two representative scaling laws for Transformer LLMs for solving various downstream tasks) and capability language models [30, 34]. evaluation (how to evaluate the abilities of LLMs and existing KM scaling law5. In 2020, Kaplan et al. (the OpenAI empirical findings). We thoroughly comb the literature and team) firstly proposed to model the power-law relationship summarize the key findings, techniques, and methods of of model performance with respective to three major factors, LLMs. For this survey, we also create a GitHub project namely model size (N ), dataset size (D), and the amount of website by collecting the supporting resources for LLMs, at training compute (C ), for neural language models. Given the link https://github.com/RUCAIBox/LLMSurvey. We a compute budget c, they empirically presented three basic are also aware of several related review articles on PLMs formulas for the scaling law6 : or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either α discuss PLMs or some specific (or general) aspects of LLMs. Nc N Compared with them, we focus on the techniques and L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1) N methods to develop and use LLMs and provide a relatively α Dc D comprehensive reference to important aspects of LLMs. L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013 D The remainder of this survey is organized as follows: αC Cc Section 2 introduces the background for LLMs and the evo- L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108 C lution of GPT-series models, followed by the summarization of available resources for developing LLMs in Section 3. where L(·) denotes the cross entropy loss in nats, and Sections 4, 5, 6, and 7 review and summarize the recent a follow-up study from OpenAI has shown that the progress from the four aspects of pre-training, adaptation, language modeling loss can be decomposed into two parts, utilization, and capacity evaluation, respectively. Then, Sec- namely irreducible loss (the entropy of the true data distri- tion 8 discusses the practical guide for prompt design, bution) and reducible loss (an estimate of the KL divergence and Section 9 reviews the applications of LLMs in several between the true and model distributions). The three laws representative domains. Finally, we conclude the survey in were derived by fitting the model performance with varied Section 10 by summarizing the major findings and discuss data sizes (22M to 23B tokens), model sizes (768 to 1.5B non- the remaining issues for future work. embedding parameters) and training compute, under some assumptions (e.g., the analysis of one factor should be not bottlenecked by the other two factors). They showed that 2 OVERVIEW the model performance has a strong dependence relation on In this section, we present an overview about the back- the three factors. ground of LLMs and then summarize the technical evolu- Chinchilla scaling law. As another representative study, tion of the GPT-series models. Hoffmann et al. (the Google DeepMind team) proposed an alternative form for scaling laws to instruct the compute- 2.1 Background for LLMs optimal training for LLMs. They conducted rigorous exper- iments by varying a larger range of model sizes (70M to Typically, large language models (LLMs) refer to Transformer 16B) and data sizes (5B to 500B tokens), and fitted a similar language models that contain hundreds of billions (or scaling law yet with different coefficients as below : more) of parameters4 , which are trained on massive text data , such as GPT-3 , PaLM , Galactica , A B L(N, D) = E + + β, (2) and LLaMA. LLMs exhibit strong capacities to un- Nα D derstand natural language and solve complex tasks (via where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and text generation). To have a quick understanding of how β = 0.28. By optimizing the loss L(N, D) under the con- LLMs work, this part introduces the basic background for straint C ≈ 6N D, they showed that the optimal allocation LLMs, including scaling laws, emergent abilities and key of compute budget to model size and data size can be techniques. derived as follows: Formulation of Scaling Laws for LLMs. Currently, LLMs a b are mainly built upon the Transformer architecture , C C Nopt (C) = G , Dopt (C) = G−1 , (3) where multi-head attention layers are stacked in a very 6 6 deep neural network. Existing LLMs adopt similar Trans- α β where a = α+β , b = α+β and G is a scaling coefficient that former architectures and pre-training objectives (e.g., lan- can be computed by A, B , α and β. As analyzed in , guage modeling) as small language models. However, LLMs significantly extend the model size, data size, and total 5. Since there was not a model trained following this law in the compute (orders of magnification). Extensive research has original paper, we took the last names of the two co-first authors to name this scaling law. 4. In existing literature, there is no formal consensus on the minimum 6. Here, Nc , Dc and Cc are measured in the number of non- parameter scale for LLMs, since the model capacity is also related to embedding parameters, the number of training tokens and the number data size and total compute. In this survey, we take a slightly loose of FP-days, respectively. According to the original paper , Cc and C definition of LLMs, and mainly focus on discussing language models should be denoted by Ccmin and Cmin , corresponding to the optimal with a model size larger than 10B. use of compute. We use the simplified notations for ease of discussions. 5 given an increase in compute budget, the KM scaling law characterize task-level scaling laws, since it might be also favors a larger budget allocation in model size than the data dependent on task-related information (task metric, task size, while the Chinchilla scaling law argues that the two difficulty, etc.). Furthermore, some capacities (e.g., in-context sizes should be increased in equal scales, i.e., having similar learning ) are unpredictable according to the scaling law, values for a and b in Equation (3). which can be observed only when the model size exceeds a certain level (as discussed below). Discussion on Scaling Laws. After introducing the formu- lations, we continue to discuss scaling law in the following Emergent Abilities of LLMs. In the literature , emergent two aspects, to enhance its understanding: abilities of LLMs are formally defined as “the abilities that Predictable scaling. In practice, scaling law can be used are not present in small models but arise in large models”, to instruct the training of LLMs, and it has been proven which is one of the most prominent features that distin- feasible to reliably estimate the performance of larger mod- guish LLMs from previous PLMs. It further introduces a els based on that of smaller models, called predictable scal- notable characteristic when emergent abilities occur : ing. The benefits of predictable scaling for training performance rises significantly above random when the LLMs are mainly twofold. Firstly, for large models, it is scale reaches a certain level. By analogy, such an emergent infeasible to rigorously examine various training tricks or pattern has close connections with the phenomenon of phase variants, and it would be very helpful if experiences gained transition in physics [31, 63]. In principle, emergent abilities from small models could also apply to large models. For can be defined in relation to some complex tasks [31, 64], instance, small proxy models can be trained to find the while we are more concerned with general abilities that optimal schedule of the data mixture for large models. can be applied to solve a variety of tasks. Here, we briefly Secondly, the training of large-scale models takes a long introduce three typical emergent abilities for LLMs and time, often suffering from issues such as training loss spike, representative models that possess such an ability8. and scaling law can be employed to monitor the training In-context learning. The in-context learning (ICL) ability status of LLMs, e.g., identifying abnormal performance at an is formally introduced by GPT-3 : assuming that the early time. Despite that scaling law characterizes a smooth language model has been provided with a natural language trend of performance increase (or loss decrease), it also instruction and/or several task demonstrations, it can gen- indicates that diminishing returns7 might occur as model erate the expected output for the test instances by com- scaling. An empirical study from the OpenAI team pleting the word sequence of input text, without requiring has shown that representation quality or semantic content additional training or gradient update9. Among the GPT- can still effectively improve even if approaching the point series models, the 175B GPT-3 model exhibited a strong ICL of diminishing returns (i.e., approaching the irreducible ability in general, but not the GPT-1 and GPT-2 models. Such loss). This finding suggests that training large models an ability also depends on the specific downstream task. For are promising for improving the performance of down- example, the ICL ability can emerge on the arithmetic tasks stream tasks. To further explore scaling effect, a potential (e.g., the 3-digit addition and subtraction) for the 13B GPT-3, issue is that the amount of available data for training LLMs but 175B GPT-3 even cannot work well on the Persian QA is actually limited. With the ever-increasing model scale, the task. public text data would be soon “exhausted” for LLMs. Instruction following. By fine-tuning with a mixture of Thus, it will be meaningful to study how scaling laws apply multi-task datasets formatted via natural language descrip- to a data-constrained regime , where data repetition or tions (called instruction tuning), LLMs are shown to perform augmentation might be useful to alleviate data scarcity. well on unseen tasks that are also described in the form Task-level predictability. Existing research of scaling laws of instructions [28, 66, 67]. With instruction tuning, LLMs are mostly conducted in terms of language modeling loss are enabled to follow the task instructions for new tasks (e.g., per-token cross-entropy loss in nats ), while in without using explicit examples, thus having an improved practice we are more concerned about the performance of generalization ability. According to the experiments in , LLMs on actual tasks. Thus, a basic problem is that how instruction-tuned LaMDA-PT started to significantly the decrease of language modeling loss translates into the outperform the untuned one on unseen tasks when the improvement of task performance. Intuitively, a model model size reached 68B, but not for 8B or smaller model with a smaller language modeling loss tends to yield a sizes. A recent study found that a model size of 62B is better performance on downstream tasks, since language at least required for PaLM to perform well on various tasks modeling loss can be considered as a general measure of in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA the overall model capacity. GPT-4 has reported that and MGSM), though a much smaller size might suffice for some capabilities (e.g., coding ability) can be accurately some specific tasks (e.g., MMLU). predicted via scaling law. Despite that, readers should be Step-by-step reasoning. For small language models, it aware that a direct decrease in language modeling loss does is usually difficult to solve complex tasks that involve not always indicate an improvement of model performance 8. It is difficult to accurately examine the critical size for emergent on downstream tasks. Specially, the phenomenon of inverse abilities of LLMs (i.e., the minimum size to possess an ability), since it scaling would occur for some tasks, where task performance might vary for different models or tasks. Also, existing studies often surprisingly becomes worse as the language modeling loss test emergent abilities on very limited model sizes for a specific LLM. decreases. Overall, it is more difficult to explore and For example, PaLM is often tested with three sizes of 8B, 62B and 540B. It is unclear about the model performance of the untested sizes. 9. In a recent study , it also shows that in-context learning implic- 7. https://en.wikipedia.org/wiki/Diminishing returns itly performs meta-optimization through the attention mechanism. 6 multiple reasoning steps, e.g., mathematical word problems. every day. It is interesting that young parents would be often In contrast, with the chain-of-thought (CoT) prompting surprised by unexpected progress of the speaking ability strategy , LLMs can solve such tasks by utilizing the exhibited by their babies. prompting mechanism that involves intermediate reasoning steps for deriving the final answer. This ability is speculated Key Techniques for LLMs. It has been a long way that to be potentially obtained by training on code [33, 47]. An LLMs evolve into the current state: general and capable empirical study has shown that CoT prompting can learners. In the development process, a number of impor- bring performance gains (on arithmetic reasoning bench- tant techniques are proposed, which largely improve the marks) when applied to PaLM and LaMDA variants with capacity of LLMs. Here, we briefly list several important a model size larger than 60B, while its advantage over techniques that (potentially) lead to the success of LLMs, as the standard prompting becomes more evident when the follows. model size exceeds 100B. Furthermore, the performance Scaling. As discussed in previous parts, there exists improvement with CoT prompting seems to be also varied an evident scaling effect in Transformer language mod- for different tasks, e.g., GSM8K > MAWPS > SWAMP for els: larger model/data sizes and more training compute PaLM. typically lead to an improved model capacity [30, 34]. As two representative models, GPT-3 and PaLM explored the How Emergent Abilities Relate to Scaling Laws. In existing scaling limits by increasing the model size to 175B and literature [30, 31, 34], scaling laws and emergent abilities 540B, respectively. Since compute budget is usually limited, provide two perspectives to understand the advantage of scaling laws can be further employed to conduct a more large models over small models. In general, scaling law compute-efficient allocation of the compute resources. For (often measured by language modeling loss) describes pre- example, Chinchilla (with more training tokens) outper- dictable performance relation with the potential effect of forms its counterpart model Gopher (with a larger model diminishing returns, while emergent abilities (often mea- size) by increasing the data scale with the same compute sured by task performance) are unpredictable but very prof- budget. In addition, data scaling should be with careful itable once such abilities actually emerge. Since the two cleaning process, since the quality of pre-training data plays perspectives reflect different performance trends (continu- a key role in the model capacity. ous improvement v.s. sharp performance leap), they might Training. Due to the huge model size, it is very chal- lead to misaligned findings or observations. There are also lenging to successfully train a capable LLM. Distributed extensive debates on the rationality of emergent abilities. training algorithms are needed to learn the network param- A popular speculation is that emergent abilities might be eters of LLMs, in which various parallel strategies are of- partially attributed to the evaluation setting for special tasks ten jointly utilized. To support distributed training, several (e.g., the discontinuous evaluation metrics) [70, 71]: when optimization frameworks have been released to facilitate evaluation metrics are altered accordingly, the sharpness of the implementation and deployment of parallel algorithms, the emergent ability curve would disappear. However, the such as DeepSpeed and Megatron-LM [75–77]. Also, op- performance of LLMs on most tasks are perceived by users timization tricks are also important for training stability and naturally in a discontinuous way. For instance, end users model performance, e.g., restart to overcome training loss prefer a reliable code generated by LLMs that can success- spike and mixed precision training. More recently, fully pass the test case, but are less interested in selecting a GPT-4 proposes to develop special infrastructure and better code with fewer errors between two failed ones. More optimization methods that reliably predict the performance recently, a study proposes a new evaluation setting of large models with much smaller models. that can enlarge the resolution of task metrics, making task Ability eliciting. After being pre-trained on large-scale performance more predictable. Despite these efforts, more corpora, LLMs are endowed with potential abilities as fundamental research (e.g., grokking10 ) about the working general-purpose task solvers. These abilities might not be mechanism of LLMs is still in need to understand the emer- explicitly exhibited when LLMs perform some specific tasks. gence of certain abilities. The subtle relation between scaling As the technical approach, it is useful to design suitable task law and emergent abilities can be explained by analogy with instructions or specific in-context learning strategies to elicit the ability acquisition of human11. Take the speaking ability such abilities. For instance, chain-of-thought prompting has as an example. For children, language development (espe- been shown to be useful to solve complex reasoning tasks cially infants) can be also considered as a multi-level process by including intermediate reasoning steps. Furthermore, where “emergent abilities” occur. Specially, the language we can perform instruction tuning on LLMs with task ability would relatively stable within a time interval, but descriptions expressed in natural language, for improving qualitative change only occurs when evolving into another the generalizability of LLMs on unseen tasks. These eliciting ability level (e.g., from speaking simple words to speaking techniques mainly correspond to the emergent abilities of simple sentences). Such a learning process is essentially not LLMs, which may not show the same effect on small lan- smooth and stable (i.e., language ability does not develop at guage models. a constant rate over time), though a child actually grows Alignment tuning. Since LLMs are trained to capture the data characteristics of pre-training corpora (including 10. Grokking refers that “a pattern in the data, improving generaliza- both high-quality and low-quality data), they are likely to tion performance from random chance level to perfect generalization”, quoted from the original paper. generate toxic, biased, or even harmful content for humans. 11. This explanation is only for ease of understanding, and there is It is necessary to align LLMs with human values, e.g., helpful, not direct evidence to connect the two points. honest, and harmless. For this purpose, InstructGPT 7 designs an effective tuning approach that enables LLMs to models was already explored in the early days of Ope- follow the expected instructions, which utilizes the tech- nAI, while it was attempted with recurrent neural net- nique of reinforcement learning with human feedback [66, 79]. works (RNN). With the advent of Transformer, OpenAI It incorporates human in the training loop with elaborately developed two initial GPT models, namely GPT-1 and designed labeling strategies. ChatGPT is indeed developed GPT-2 , which can be considered as the foundation to on a similar technique to InstructGPT, which shows a strong more powerful models subsequently i.e., GPT-3 and GPT-4. alignment capacity in producing high-quality, harmless re- GPT-1. In 2017, the Transformer model was intro- sponses, e.g., rejecting to answer insulting questions. duced by Google, and the OpenAI team quickly adapted Tools manipulation. In essence, LLMs are trained as text their language modeling work to this new neural network generators over massive plain text corpora, thus performing architecture. They released the first GPT model in 2018, less well on the tasks that are not best expressed in the i.e., GPT-1 , and coined the abbreviation term GPT form of text (e.g., numerical computation). In addition, their as the model name, standing for Generative Pre-Training. capacities are also limited to the pre-training data, e.g., the GPT-1 was developed based on a generative, decoder-only inability to capture up-to-date information. To tackle these Transformer architecture, and adopted a hybrid approach of issues, a recently proposed technique is to employ external unsupervised pre-training and supervised fine-tuning. GPT- tools to compensate for the deficiencies of LLMs [80, 81]. 1 has set up the core architecture for the GPT-series models For example, LLMs can utilize the calculator for accurate and established the underlying principle to model natural computation and employ search engines to retrieve language text, i.e., predicting the next word. unknown information. More recently, ChatGPT has GPT-2. Following a similar architecture of GPT-1, enabled the mechanism of using external plugins (existing GPT-2 increased the parameter scale to 1.5B, which or newly created apps)12 , which are by analogy with the was trained with a large webpage dataset WebText. As “eyes and ears” of LLMs. Such a mechanism can broadly claimed in the paper of GPT-2, it sought to perform expand the scope of capacities for LLMs. tasks via unsupervised language modeling, without explicit In addition, many other factors (e.g., the upgrade of fine-tuning using labeled data. To motivate the approach, hardware) also contribute to the success of LLMs. Currently, they introduced a probabilistic form for multi-task solving, we limit our discussion to the major technical approaches i.e., p(output|input, task) (similar approaches have been and key findings for developing LLMs. adopted in ), which predicts the output conditioned on the input and task information. To model this conditional probability, language text can be naturally employed as a 2.2 Technical Evolution of GPT-series Models unified way to format input, output and task information. Due to the excellent capacity in communicating with hu- In this way, the process of solving a task can be cast as a mans, ChatGPT has ignited the excitement of the AI com- word prediction problem for generating the solution text. munity since its release. ChatGPT is developed based on the Further, they introduced a more formal claim for this idea: powerful GPT model with specially optimized conversation “Since the (task-specific) supervised objective is the same capacities. Considering the ever-growing interest in Chat- as the unsupervised (language modeling) objective but only GPT and GPT models, we add a special discussion about the evaluated on a subset of the sequence, the global minimum technical evolution of the GPT-series models, to briefly sum- of the unsupervised objective is also the global minimum marize the progress how they have been developed in the of the supervised objective (for various tasks)” 15. A past years. Meanwhile, we drew a schematic diagram de- basic understanding of this claim is that each (NLP) task picting the technological evolution of the GPT-series models can be considered as the word prediction problem based in Figure 4. The basic principle underlying GPT models is on a subset of the world text. Thus, unsupervised language to compress the world knowledge into the decoder-only modeling could be capable in solving various tasks, if it was Transformer model by language modeling, such that it can trained to have sufficient capacity in recovering the world recover (or memorize) the semantics of world knowledge text. These early discussion in GPT-2’s paper echoed in the and serve as a general-purpose task solver. Two key points interview of Ilya Sutskever by Jensen Huang: “What the to the success are (I) training decoder-only Transformer neural network learns is some representation of the process language models that can accurately predict the next word that produced the text. This text is actually a projection of and (II) scaling up the size of language models. Overall, the the world...the more accurate you are in predicting the next research of OpenAI on LLMs can be roughly divided into word, the higher the fidelity, the more resolution you get in the following stages13. this process...”16. Early Explorations. According to one interview with Ilya Capacity Leap. Although GPT-2 is intended to be an “un- Sutskever14 (a co-founder and chief scientist of OpenAI), supervised multitask learner”, it overall has an inferior the idea of approaching intelligent systems with language performance compared with supervised fine-tuning state- of-the-art methods. Because it has a relatively small model 12. https://openai.com/blog/chatgpt-plugins size, it has been widely fine-tuned in downstream tasks, 13. Note that the discussion of this part can be somewhat subjective. especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3 The overall viewpoints and summaries are made based on the under- standing of the survey authors by reading the papers, blog articles, interview reports and APIs released by OpenAI. 15. To better understand this sentence, we put some explanation 14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- words in parentheses. founder-of-openai 16. https://lifearchitect.ai/ilya/ 8 TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs. In this table, we only include LLMs with a public paper about the technical details. Here, “Release Time” indicates the date when the corresponding paper was officially released. “Publicly Available” means that the model checkpoints can be publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the model has been with subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback. “Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL denotes in-context learning and CoT denotes chain-of-thought. “*” denotes the largest publicly available version. Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation Model Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT T5 Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - ✓ - mT5 Oct-2020 13 - - - 1T tokens - - - ✓ - PanGu-α Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - ✓ - CPM-2 Jun-2021 198 - - - 2.6TB - - - - - T0 Oct-2021 11 T5 ✓ - - - 512 TPU v3 27 h ✓ - CodeGen Mar-2022 16 - - - 577B tokens - - - ✓ - GPT-NeoX-20B Apr-2022 20 - - - 825GB - 96 40G A100 - ✓ - Tk-Instruct Apr-2022 11 T5 ✓ - - - 256 TPU v3 4h ✓ - UL2 May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - ✓ ✓ OPT May-2022 175 - - - 180B tokens - 992 80G A100 - ✓ - NLLB Jul-2022 54.5 - - - - - - - ✓ - CodeGeeX Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d ✓ - GLM Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d ✓ - Flan-T5 Oct-2022 11 T5 ✓ - - - - - ✓ ✓ BLOOM Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d ✓ - mT0 Nov-2022 13 mT5 ✓ - - - - - ✓ - Galactica Nov-2022 120 - - - 106B tokens - - - ✓ ✓ BLOOMZ Nov-2022 176 BLOOM ✓ - - - - - ✓ - Publicly OPT-IML Dec-2022 175 OPT ✓ - - - 128 40G A100 - ✓ ✓ Available LLaMA Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d ✓ - Pythia Apr-2023 12 - - - 300B tokens - 256 40G A100 - ✓ - CodeGen2 May-2023 16 - - - 400B tokens - - - ✓ - StarCoder May-2023 15.5 - - - 1T tokens - 512 40G A100 - ✓ ✓ LLaMA2 Jul-2023 70 - ✓ ✓ 2T tokens - 2000 80G A100 - ✓ - Baichuan2 Sep-2023 13 - ✓ ✓ 2.6T tokens - 1024 A800 - ✓ - QWEN Sep-2023 14 - ✓ ✓ 3T tokens - - - ✓ - FLM Sep-2023 101 - ✓ - 311B tokens - 192 A800 22 d ✓ - Skywork Oct-2023 13 - - - 3.2T tokens - 512 80G A800 - ✓ - GPT-3 May-2020 175 - - - 300B tokens - - - ✓ - GShard Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - - Codex Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - ✓ - ERNIE 3.0 Jul-2021 10 - - - 375B tokens - 384 V100 - ✓ - Jurassic-1 Aug-2021 178 - - - 300B tokens - 800 GPU - ✓ - HyperCLOVA Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d ✓ - FLAN Sep-2021 137 LaMDA-PT ✓ - - - 128 TPU v3 60 h ✓ - Yuan 1.0 Oct-2021 245 - - - 180B tokens - 2128 GPU - ✓ - Anthropic Dec-2021 52 - - - 400B tokens - - - ✓ - WebGPT Dec-2021 175 GPT-3 - ✓ - - - - ✓ - Gopher Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h ✓ - ERNIE 3.0 Titan Dec-2021 260 - - - - - - - ✓ - GLaM Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h ✓ - LaMDA Jan-2022 137 - - - 768B tokens - 1024 TPU v3 57.7 d - - MT-NLG Jan-2022 530 - - - 270B tokens - 4480 80G A100 - ✓ - Closed AlphaCode Feb-2022 41 - - - 967B tokens Jul-2021 - - - - Source InstructGPT Mar-2022 175 GPT-3 ✓ ✓ - - - - ✓ - Chinchilla Mar-2022 70 - - - 1.4T tokens - - - ✓ - PaLM Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - ✓ ✓ AlexaTM Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d ✓ ✓ Sparrow Sep-2022 70 - - ✓ - - 64 TPU v3 - ✓ - WeLM Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d ✓ - U-PaLM Oct-2022 540 PaLM - - - - 512 TPU v4 5d ✓ ✓ Flan-PaLM Oct-2022 540 PaLM ✓ - - - 512 TPU v4 37 h ✓ ✓ Flan-U-PaLM Oct-2022 540 U-PaLM ✓ - - - - - ✓ ✓ GPT-4 Mar-2023 - - ✓ ✓ - - - - ✓ ✓ PanGu-Σ Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d ✓ - PaLM2 May-2023 16 - ✓ - 100B tokens - - - ✓ ✓ 9 T5 GShard Publicly Available 2019 mT5 PanGu-𝛂 Ernie 3.0 2020 YuLan-Chat 2021 Jurassic-1 1-4 PLUG GPT-3 StarCoder Codex 5-8 CPM-2 FLAN CodeGen2 T0 9-10 LaMDA Anthropic Yuan 1.0 ChatGLM HyperCLOVA AlphaCode WebGPT 11-12 Falcon Chinchilla Ernie 3.0 Titan InstructGPT 2022 PaLM2 UL2 Sparrow Gopher CodeGen Pythia InternLM 1-3 Qwen2 MT-NLG PaLM Flan-T5 Qwen GLaM OPT Vicuna DeepSeek-V2 YaLM Flan-PaLM Mistral CodeGeeX GPT-NeoX-20B PanGu-Σ LLaMA3 4-6 Luminous BLOOM Tk-Instruct Bard Deepseek MiniCPM GLM mT0 7-10 NLLB Cohere LLaMA Mixtral Gemma AlexaTM BLOOMZ 11-12 WeLM 2023 1-6 7-12 Galatica 2024 1-6 OPT-IML ChatGPT GPT-4 LLaMA2 Fig. 3: A timeline of existing large language models (having a size larger than 10B) in recent years. The timeline was established mainly according to the release date (e.g., the submission date to arXiv) of the technical paper for a model. If there was no corresponding paper, we set the date of a model as the earliest time of its public release or announcement. We mark the LLMs with publicly available model checkpoints in yellow color. Due to the space limit of the figure, we only include the LLMs with publicly reported evaluation results. ChatGPT GPT-1 GPT-2 GPT-3 +code Codex GPT-3.5 GPT-4 2018.06 2019.02 2020.05 2021.07 2022.03 2023.03 decoder-only architecture unsupervised multitask learner in-context learning code pre-training strong reasoning ability generative pre-training scaling the model size exploring scaling limits GPT-4 Turbo 2023.09 longer context window code-davinci-002 +instruction text-davinci-002 +RLHF text-davinci-003 +chat gpt-3.5-turbo 2022.03 2022.03 2022.09 2023.03 GPT-4 Turbo with vision 2023.09 capable code model instruction following human alignment excellent comprehensive ability multimodal ability Fig. 4: A brief illustration for the technical evolution of GPT-series models. We plot this figure mainly based on the papers, blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official statement that a new model is developed based on a base model) on the evolution path between two models, while dashed lines denote a relatively weaker evolution relation. demonstrates a key capacity leap by scaling of the (nearly cellent performance in a variety of NLP tasks, but also on a same) generative pre-training architecture. number of specially designed tasks that require the abilities GPT-3. GPT-3 was released in 2020, which scaled of reasoning or domain adaptation. Although the GPT-3’s the model parameters to an ever larger size of 175B. In paper does not explicitly discuss the emergent abilities of the GPT-3’s paper, it formally introduced the concept of LLMs, we can observe large performance leap that might in-context learning (ICL)17 , which utilizes LLMs in a few- transcend the basic scaling law , e.g., larger models have shot or zero-shot way. ICL can teach (or instruct) LLMs to significantly stronger ICL ability (illustrated in the original understand the tasks in the form of natural language text. Figure 1.2 of the GPT-3’s paper ). Overall, GPT-3 can be With ICL, the pre-training and utilization of LLMs converge viewed as a remarkable landmark in the journey evolving to the same language modeling paradigm: pre-training pre- from PLMs to LLMs. It has empirically proved that scaling dicts the following text sequence conditioned on the context, the neural networks to a significant size can lead to a huge while ICL predicts the correct task solution, which can be increase in model capacity. also formatted as a text sequence, given the task description Capacity Enhancement. Due to the strong capacities, GPT- and demonstrations. GPT-3 not only demonstrates very ex- 3 has been the base model to develop even more capable 17. GPT-2 essentially used ICL for unsupervised task learning, LLMs for OpenAI. Overall, OpenAI has explored two major though it wasn’t called ICL at that time. approaches to further improving the GPT-3 model, i.e., train- 10 ing on code data and alignment with human preference, GPT-3.5 models by OpenAI (see the discussion about the which are detailed as follows. OpenAI API in Section 3.1). Training on code data. A major limitation of the original The Milestones of Language Models. Based on all the ex- GPT-3 model (pre-trained on plain text) lies in the lack of ploration efforts, two major milestones have been achieved the reasoning ability on complex tasks, e.g., completing the by OpenAI, namely ChatGPT and GPT-4 , which code and solving math problems. To enhance this ability, have largely raised the capacity bar of existing AI systems. Codex was introduced by OpenAI in July 2021, which was a GPT model fine-tuned on a large corpus of GitHub ChatGPT. In November 2022, OpenAI released the code. It demonstrated that Codex can solve very difficult conversation model ChatGPT, based on the GPT models programming problems, and also lead to a significant per- (GPT-3.5 and GPT-4). As the official blog article intro- formance improvement in solving math problems. duced , ChatGPT was trained in a similar way as Further, a contrastive approach to training text and InstructGPT (called “a sibling model to InstructGPT” in the code embedding was reported in January 2022, which was original post), while specially optimized for dialogue. They shown to improve a series of related tasks (i.e., linear- reported a difference between the training of ChatGPT and probe classification, text search and code search). Actually, InstructGPT in the data collection setup: human-generated the GPT-3.5 models are developed based on a code-based conversations (playing both the roles of user and AI) are GPT model (i.e., code-davinci-002), which indicates that combined with the InstructGPT dataset in a dialogue format training on code data is a very useful practice to improve for training ChatGPT. ChatGPT exhibited superior capaci- the model capacity of GPT models, especially the reasoning ties in communicating with humans: possessing a vast store ability. Furthermore, there is also a speculation that train- of knowledge, skill at reasoning on mathematical problems, ing on code data can greatly increase the chain-of-thought tracing the context accurately in multi-turn dialogues, and prompting abilities of LLMs , while it is still worth aligning well with human values for safe use. Later on, the further investigation with more thorough verification. plugin mechanism has been supported in ChatGPT, which further extends the capacities of ChatGPT with existing tools Human alignment. The related research of human or apps. So far, it seems to be the ever most powerful chatbot alignment can be dated back to the year 2017 (or earlier) in the AI history. The launch of ChatGPT has a significant for OpenAI: a blog article entitled “learning from human impact on the AI research in the future, which sheds light preferences”18 was posted on the OpenAI blog describing on the exploration of human-like AI systems. a work that applied reinforcement learning (RL) to learn from the preference comparisons annotated by humans GPT-4. As another remarkable progress, GPT-4 (similar to the reward training step in the aligning algorithm was released in March 2023, which extended the text input of InstructGPT in Figure 12). Shortly after the release of this to multimodal signals. Overall, GPT-4 has stronger capac- RL paper , the paper of the Proximal Policy Optimiza- ities in solving complex tasks than GPT-3.5, showing a tion (PPO) was published in July 2017, which now has large performance improvement on many evaluation tasks. been the foundational RL algorithm for learning from hu- A recent study investigated the capacities of GPT- man preferences. Later in January 2020, GPT-2 was fine- 4 by conducting qualitative tests with human-generated tuned using the aforementioned RL algorithms [79, 128], problems, spanning a diverse range of difficult tasks, and which leveraged human preferences to improve the capac- showed that GPT-4 can achieve more superior performance ities of GPT-2 on NLP tasks. In the same year, another than prior GPT models. Furthermore, GPT-4 responds more work trained a summarization model for optimizing safely to malicious or provocative queries, due to a six- human preferences in a similar way. Based on these prior month iterative alignment (with an additional safety re- work, InstructGPT was proposed in January 2022 to ward signal in the RLHF training). In the technical report, improve the GPT-3 model for human alignment, which OpenAI has emphasized how to safely develop GPT-4 and formally established a three-stage reinforcement learning from applied a number of intervention strategies to mitigate the human feedback (RLHF) algorithm. Note that it seems that possible issues of LLMs, such as hallucinations, privacy the wording of “instruction tuning” has seldom been used in and overreliance. For example, they introduced the mech- OpenAI’s paper and documentation, which is substituted by anism called red teaming to reduce the harm or toxic supervised fine-tuning on human demonstrations (i.e., the first content generation. As another important aspect, GPT-4 step of the RLHF algorithm ). In addition to improving has been developed on a well-established deep learning the instruction following capacity, the RLHF algorithm is infrastructure with improved optimization methods. They particularly useful to mitigate the issues of generating harm introduced a new mechanism called predictable scaling that or toxic content for LLMs, which is key to the safe deploy- can accurately predict the final performance with a small ment of LLMs in practice. OpenAI describes their approach proportion of compute during model training. to alignment research in a technical article , which GPT-4V, GPT-4 turbo, and beyond. Based on the work has summarized three promising directions: “training AI done for GPT-4 , OpenAI further released GPT-4V in systems to use human feedback, to assist human evaluation September 2023, which focused on the safe deployment of and to do alignment research”. the vision capabilities of GPT-4. In the GPT-4V’s system card , it has extensively discussed the assessment and These enhancement techniques lead to the improved mitigation of risks related to visually augmented inputs. GPT-3 models with stronger capacities, which are called Specially, GPT-4V exhibited strong vision capacities in var- ious application scenarios, showing the great potential as 18. https://openai.com/research/learning-from-human-preferences a powerful multimodal learning system. More recently, in 11 November 2023, OpenAI released an upgraded generation model, and its performance evaluation in downstream tasks. of GPT-4 model at DevDay, named GPT-4 Turbo, with a For more details of LLMs, see Table 1. series of technical improvements. GPT-4 Turbo is featured LLaMA. The LLaMA series of models has gained im- by the improved model capacity (more capable than GPT- mense popularity and widespread attention due to its open- 4), the extended knowledge source (up to April 2023), ness and effectiveness. From LLaMA , LLaMA-2 , long context window (up to 128k tokens), optimized model LLaMA-3 to LLaMA-3.1 , continuous updates performance (cheaper price), and other useful functional- have been made and the development is still ongoing. With ity updates (function call, reproducible outputs, etc.). At increased parameters (the largest version has 405B), more the same time, Assistants API was launched to ease the pre-training tokens (15T tokens), and an extended context rapid development of agent-like assistants. With this API, window (128K), LLaMA-3.1 has significantly enhanced its developers can easily create goal-oriented assistants within capabilities, and it also integrates additional components their applications, by leveraging specific instruction, extra that work in synergy with the model, including new se- knowledge and tool use. Furthermore, multimodal capaci- curity and safety tools. In evaluation, LLaMa-3.1 (405B ver- ties (see, hear, and speak) were also enhanced in this new sion) achieves competitive performance against prominent release, supported by GPT-4 Turbo with vision, DALL·E 3, closed-source LLMs, such as GPT-4, GPT-4o, and Claude Text-to-speech (TTS), and Listen to voice samples. These 3.5 Sonnet in various benchmarks (e.g., MMLU, GSM8k, improvements have greatly extended the capacity scope and and HumanEval). The pre-training of LLaMA (65B version) enhanced the task performance of GPT models. More impor- involves 2,048 A100-80G GPUs, whereas LLaMA-3.1 (405B tantly, the application ecosystem will be greatly strength- version) involves more than 16,000 H100 GPUs. ened with the technology upgrade in improved models, Mistral. The Mistral series [137, 138] consist of Mis- APIs, and functionalities. tral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and Despite the huge progress, there are still limitations with Mixtral (8×7B and 8×22B), which have been widely known these superior LLMs, e.g., generating hallucinations with for their strong performance on various mainstream bench- factual errors or potentially risky response within some marks (e.g., MMLU and GSM8k). Mistral NeMo is featured specific context. More limitations or issues of LLMs will with a long context window of 128K at the parameter scale be discussed in Section 7. It poses long-standing research of 12B. Although Mistral NeMo is trained with quantization challenges to develop more capable, safer LLMs. From awareness, it enables FP8 inference without sacrificing per- the perspective of engineering, OpenAI has adopted an formance. Mistral Large 2 is the largest and most powerful iterative deployment strategy to develop the models model of the Mistral series, which supports 11 natural and products by following a five-stage development and languages and more than 80 programming languages. Mix- deployment life-cycle, which aims to effectively reduce the tral is a kind of sparse Mixture-of-Experts (SMoE) model potential risks of using the models. In the following, we that activates only part of the parameters during inference, will dive into the technical details in order to have a specific making it more efficient compared to dense models of the understanding of how they have been developed. same size. Gemma. Gemma [139, 140] is a series of lightweight, strong, and open models, consisting of Gemma-1 (2B and 3 R ESOURCES OF LLM S 7B) and Gemma-2 (2B, 9B, and 27B). During the pre-training It is by no means an easy job to develop or reproduce LLMs, stage, Gemma-2 2B, 9B, and 27B versions are trained on considering the challenging technical issues and huge de- 2T, 8T, and 13T primarily English tokens, respectively. The mands of computation resources. A feasible way is to learn largest version of Gemma-2 is trained on 6144 TPUv5p experiences from existing LLMs and reuse publicly avail- chips. Gemma-2 has achieved excellent performance in mul- able resources for incremental development or experimental tiple benchmarks (e.g., ARC-c, MMLU, and GSM8k). study. In this section, we briefly summarize the publicly Qwen. Qwen [141, 142] is an open-source large available resources for developing LLMs, including model model series consisting of Qwen (raging from 7B to 72B), checkpoints (or APIs), corpora and libraries. Qwen1.5 (raging from 0.5B to 110B), Qwen2 (ranging from 0.5B to 72B), and Qwen2.5 (ranging from 0.5B to 72B). 3.1 Publicly Available Model Checkpoints or APIs Qwen2.5 is the newest LLM collection of Qwen, which is pre-trained on up to 18T tokens. Compared to Qwen2, Given the huge cost of model pre-training, well-trained Qwen2.5 demonstrates a significant increase