2307.06435v9.pdf

A Comprehensive Overview of Large Language Models Humza Naveeda , Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f , Muhammad Usmane,f , Naveed Akhtarg,i ,...

A Comprehensive Overview of Large Language Models Humza Naveeda , Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f , Muhammad Usmane,f , Naveed Akhtarg,i , Nick Barnesh , Ajmal Miani a University of Engineering and Technology (UET), Lahore, Pakistan b The Chinese University of Hong Kong (CUHK), HKSAR, China c University of Technology Sydney (UTS), Sydney, Australia d Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia e King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia f SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia g The University of Melbourne (UoM), Melbourne, Australia h Australian National University (ANU), Canberra, Australia i The University of Western Australia (UWA), Perth, Australia arXiv:2307.06435v9 [cs.CL] 9 Apr 2024 Abstract Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research. Keywords: Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking 1. Introduction Language plays a fundamental role in facilitating commu- nication and self-expression for humans, and their interaction with machines. The need for generalized models stems from the growing demand for machines to handle complex language tasks, including translation, summarization, information re- trieval, conversational interactions, etc. Recently, significant breakthroughs have been witnessed in language models, pri- marily attributed to transformers , increased computational capabilities, and the availability of large-scale training data. These developments have brought about a revolutionary trans- formation by enabling the creation of LLMs that can approxi- mate human-level performance on various tasks [2, 3]. Large ∗ Equalcontribution Email addresses: [email protected] (Humza Naveed), [email protected] (Asad Ullah Khan), [email protected] (Shi Qiu), [email protected] (Muhammad Saqib), [email protected] (Saeed Anwar), Figure 1: The trend of papers released over years containing keywords "Large [email protected] (Muhammad Usman), Language Model", "Large Language Model + Fine-Tuning", and "Large Lan- [email protected] (Naveed Akhtar), guage Model + Alignment". [email protected] (Nick Barnes), [email protected] (Ajmal Mian) Preprint submitted to Elsevier April 11, 2024 Alpaca (Mar) LLaMA (Feb) CodeGen (Mar) HuaTuo (Apr) Xuan Yuan 2.0 (May) GPT-NeoX-20B (Apr) Vicuna MPT (Jun) UL2 (May) TK-Instruct (May) Koala (May) CodeT5+ GLM (Oct) mT0 (Dec) Wizard-LM StarCoder PanGu-α (Apr) OPT LLaMA 2 (Jul) OPT-IML Wizard-Coder (Jun) T5 (Oct) mT5 (Oct) T0 (Oct) CPM-2 (Jun) Galactica (Nov) Code Llama (Aug) Goat 2019 2020 2021 2022 2023 2024 Codex (Jul) MT-NLG (Jan) PanGu-Σ (Mar) GPT-3 (May) WebGPT (Dec) ERNIE 3.0 Sparrow (Sep) AlphaCode (Feb) Bard (Oct) BloombergGPT Jurassic-1 (Aug) FLAN-U-PaLM (Oct) Chinchilla (Mar) GPT-4 HyperCLOVA (Sep) ChatGPT (Nov) PaLM (Apr) Claude Yuan 1.0 (Oct) AlexaTM (Aug) PaLM2 (May) Gopher (Dec) U-PALM (Oct) Gemini (Dec) ERNIE 3.0 Titan BLOOM (Nov) GLaM LaMDA Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models on the upper half signify open-source availability, whereas those on the bottom half are closed-source. The chart illustrates the increasing trend towards instruction- tuned models and open-source models, highlighting the evolving landscape and trends in natural language processing research. Language Models (LLMs) have emerged as cutting-edge arti- adopted in diverse settings including, multi-modal, robotics, ficial intelligence systems that can process and generate text tool manipulation, question answering, autonomous agents, etc. with coherent communication , and generalize to multiple Various improvements have also been suggested in these areas tasks [5, 6]. either by task-specific training [25, 26, 27, 28, 29, 30, 31] or The historical progress in natural language processing (NLP) better prompting. evolved from statistical to neural language modeling and then The LLMs abilities to solve diverse tasks with human-level from pre-trained language models (PLMs) to LLMs. While performance come at a cost of slow training and inference, conventional language modeling (LM) trains task-specific mod- extensive hardware requirements, and higher running costs. els in supervised settings, PLMs are trained in a self-supervised Such requirements have limited their adoption and opened up setting on a large corpus of text [7, 8, 9] with the aim of learning opportunities to devise better architectures [15, 33, 34, 35] a generic representation that is shareable among various NLP and training strategies [36, 37, 21, 38, 39, 40, 41]. Param- tasks. After fine-tuning for downstream tasks, PLMs surpass eter efficient tuning [38, 41, 40], pruning [42, 43], quantiza- the performance gains of traditional language modeling (LM). tion [44, 45], knowledge distillation, and context length inter- The larger PLMs bring more performance gains, which has led polation [46, 47, 48, 49] among others are some of the methods to the transitioning of PLMs to LLMs by significantly increas- widely studied for efficient LLM utilization. ing model parameters (tens to hundreds of billions) and Due to the success of LLMs on a wide variety of tasks, the training dataset (many GBs and TBs) [10, 11]. Following this research literature has recently experienced a large influx of development, numerous LLMs have been proposed in the lit- LLM-related contributions. Researchers have organized the erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the LLMs literature in surveys [50, 51, 52, 53], and topic-specific number of released LLMs and names of a few significant LLMs surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our proposed over the years are shown in Fig 1 and Fig 2, respec- contribution focuses on providing a comprehensive yet concise tively. overview of the general direction of LLM research. This arti- The early work on LLMs, such as T5 and mT5 em- cle summarizes architectural and training details of pre-trained ployed transfer learning until GPT-3 showed LLMs are LLMs and delves deeper into the details of concepts like fine- zero-shot transferable to downstream tasks without fine-tuning. tuning, multi-modal LLMs, augmented LLMs, datasets, eval- LLMs accurately respond to task queries when prompted with uation, applications, challenges, and others to provide a self- task descriptions and examples. However, pre-trained LLMs contained comprehensive overview. Our key contributions are fail to follow user intent and perform worse in zero-shot set- summarized as follows. tings than in few-shot. Fine-tuning them with task instruc- tions data [16, 17, 18, 19] and aligning with human prefer- We present a survey on the developments in LLM research ences [20, 21] enhances generalization to unseen tasks, im- providing a concise comprehensive overview of the direc- proving zero-shot performance significantly and reducing mis- tion. aligned behavior. We present extensive summaries of pre-trained models that In addition to better generalization and domain adaptation, include fine-grained details of architecture and training de- LLMs appear to have emergent abilities, such as reasoning, tails. planning, decision-making, in-context learning, answering in We summarize major findings of the popular contributions zero-shot settings, etc. These abilities are known to be ac- and provide a detailed discussion on the key design and quired by them due to their gigantic scale even when the pre- development aspects of LLMs to help practitioners effec- trained LLMs are not trained specifically to possess these at- tively leverage this technology. tributes [22, 23, 24]. Such abilities have led LLMs to be widely In this self-contained article, we cover a range of con- cepts to present the general direction of LLMs compre- 2 Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications 7. Challenges hensively, including background, pre-training, fine-tuning, architectures, training pipelines and strategies, fine-tuning, and multi-modal LLMs, augmented LLMs, LLMs-powered utilization in different domains. Section 4 highlights the config- agents, datasets, evaluation, etc. uration and parameters that play a crucial role in the function- ing of these models. Summary and discussions are presented We loosely follow the existing terminology to ensure a stan- in section 3.8. The LLM training and evaluation, datasets, and dardized outlook of this research direction. For instance, fol- benchmarks are discussed in section 5, followed by challenges lowing , our survey discusses pre-trained LLMs with 10B and future directions, and conclusion in sections 7 and 8, re- parameters or more. We refer the readers interested in smaller spectively. pre-trained models to [51, 52, 53]. The organization of this paper is as follows. Section 2 discusses the background of LLMs. Section 3 focuses on LLMs overview, 3 2. Background 2.4. Activation Functions We provide the relevant background to understand the fun- The activation functions serve a crucial role in the curve- damentals related to LLMs in this section. We briefly discuss fitting abilities of neural networks. We discuss activation necessary components in LLMs and refer the readers interested functions used in LLMs in this section. in details to the original works. ReLU : The Rectified linear unit (ReLU) is defined as: 2.1. Tokenization ReLU(x) = max(0, x) (1) Tokenization is an essential pre-processing step in LLM training that parses the text into non-decomposing units GeLU : The Gaussian Error Linear Unit (GeLU) is the called tokens. Tokens can be characters, subwords , sym- combination of ReLU, dropout and zoneout. bols , or words, depending on the tokenization process. GLU variants : The Gated Linear Unit is a neural Some of the commonly used tokenization schemes in LLMs network layer that is an element-wise product (⊗) of a linear include wordpiece , byte pair encoding (BPE) , and un- transformation and a sigmoid transformed (σ) linear projection igramLM. Readers are encouraged to refer to for a of the input given as: detailed survey. GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c), (2) 2.2. Encoding Positions The transformer processes input sequences in parallel and where X is the input of layer and l, W, b, V and c are learned independently of each other. Moreover, the attention mod- parameters. Other GLU variants used in LLMs are: ule in the transformer does not capture positional information. As a result, positional encodings were introduced in trans- ReGLU(x, W, V, b, c) = max(0, xW + b)⊗, former , where a positional embedding vector is added to GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c), the token embedding. Variants of positional embedding include S wiGLU(x, W, V, b, c, β) = S wishβ(xW + b) ⊗ (xV + c). absolute, relative, or learned positional encodings. Within rel- ative encoding, Alibi and RoPE are two widely used positional 2.5. Layer Normalization embeddings in LLMs. Alibi : It subtracts a scalar bias from the attention score Layer normalization leads to faster convergence and is an in- that increases with the distance between token positions. This tegrated component of transformers. In addition to Layer- favors using recent tokens for attention. Norm and RMSNorm , LLMs use pre-layer normal- RoPE : It rotates query and key representations at an an- ization , applying it before multi-head attention (MHA). gle proportional to the token absolute position in the input Pre-norm is shown to provide training stability in LLMs. An- sequence, resulting in a relative positional encoding scheme other normalization variant, DeepNorm fixes the issue with which decays with the distance between the tokens. larger gradients in pre-norm. 2.3. Attention in LLMs 2.6. Distributed LLM Training Attention assigns weights to input tokens based on impor- tance so that the model gives more emphasis to relevant tokens. This section describes distributed LLM training approaches Attention in transformers calculates query, key, and value briefly. More details are available in [13, 37, 80, 81]. mappings for input sequences, where the attention score is Data Parallelism: Data parallelism replicates the model on obtained by multiplying the query and key, and later used to multiple devices where data in a batch gets divided across de- weight values. We discuss different attention strategies used in vices. At the end of each training iteration weights are synchro- LLMs below. nized across all devices. Self-Attention : Calculates attention using queries, keys, Tensor Parallelism: Tensor parallelism shards a tensor compu- and values from the same block (encoder or decoder). tation across devices. It is also known as horizontal parallelism Cross Attention: It is used in encoder-decoder architectures, or intra-layer model parallelism. where encoder outputs are the queries, and key-value pairs Pipeline Parallelism: Pipeline parallelism shards model layers come from the decoder. across different devices. This is also known as vertical paral- Sparse Attention : Self-attention has O(n2 ) time complex- lelism. ity which becomes infeasible for large sequences. To speed Model Parallelism: A combination of tensor and pipeline par- up the computation, sparse attention iteratively calculates allelism is known as model parallelism. attention in sliding windows for speed gains. 3D Parallelism: A combination of data, tensor, and model par- Flash Attention : Memory access is the major bottleneck allelism is known as 3D parallelism. in calculating attention using GPUs. To speed up, flash Optimizer Parallelism: Optimizer parallelism also known as attention employs input tiling to minimize the memory reads zero redundancy optimizer implements optimizer state and writes between the GPU high bandwidth memory (HBM) partitioning, gradient partitioning, and parameter partitioning and the on-chip SRAM. across devices to reduce memory consumption while keeping the communication costs as low as possible. 4 2.7. Libraries Some commonly used libraries for LLMs training are: Transformers : The library provides access to various pre- trained transformer models with APIs to train, fine-tune, infer, and develop custom models. DeepSpeed : A library for scalable distributed training and inference of deep learning models. Megatron-LM : It provides GPU-optimized techniques for large-scale training of LLMs. Figure 4: An example of attention patterns in language models, image is taken JAX : A Python library for high-performance numerical from. computing and scaleable machine learning. It can differenti- ate native Python and NumPy functions and execute them on GPUs. Colossal-AI : A collection of components to write dis- tributed deep learning models. BMTrain : A library to write efficient stand-alone LLMs training code. FastMoE : Provides API to build mixture-of-experts (MoE) model in PyTorch. Figure 5: An example of language model training objectives, image from. MindSpore : A deep learning training and inference frame- work extendable to mobile, edge, and cloud computing. the attention and the connection of transformer blocks. An il- PyTorch : A framework developed by Facebook AI Re- lustration of attention patterns of these architectures is shown search lab (FAIR) to build deep learning models. The main in Figure 4. features of PyTorch include a dynamic computation graph and Encoder Decoder: This architecture processes inputs through a pythonic coding style. the encoder and passes the intermediate representation to the Tensorflow : A deep learning framework written by decoder to generate the output. Here, the encoder sees the Google. The key features of TensorFlow are graph-based com- complete sequence utilizing self-attention whereas the decoder putation, eager execution, scalability, etc. processes the sequence one after the other with implementing MXNet : Apache MXNet is a deep learning framework cross-attention. with support to write programs in multiple languages, includ- Causal Decoder: A type of architecture that does not have an ing, Python, C++, Scala, R, etc. It also provides support for encoder and processes and generates output using a decoder, dynamic and static computation graphs. where the predicted token depends only on the previous time 2.8. Data PreProcessing steps. Prefix Decoder: It is also known as a non-causal decoder, This section briefly summarizes data preprocessing tech- where the attention calculation is not strictly dependent on the niques used in LLMs training. past information and the attention is bidirectional. An example Quality Filtering: For better results, training data quality is of a non-causal attention mask is shown in Figure 4. essential. Some approaches to filtering data are: 1) classifier- Mixture-of-Experts: It is a variant of transformer architecture based and 2) heuristics-based. Classifier-based approaches with parallel independent experts and a router to route tokens train a classifier on high-quality data and predict the quality of to experts. These experts are feed-forward layers after the at- text for filtering, whereas heuristics-based employ some rules tention block. Mixture-of-Experts (MoE) is an efficient for filtering like language, metrics, statistics, and keywords. sparse architecture that offers comparable performance to dense Data Deduplication: Duplicated data can affect model per- models and allows increasing the model size without increas- formance and increase data memorization; therefore, to train ing the computational cost by activating only a few experts at a LLMs, data deduplication is one of the preprocessing steps. time [91, 92]. This can be performed at multiple levels, like sentences, documents, and datasets. Privacy Reduction: Most of the training data for LLMs is 2.10. Pre-Training Objectives collected through web sources. This data contains private This section describes LLMs pre-training objectives. For information; therefore, many LLMs employ heuristics-based more details see the paper. methods to filter information such as names, addresses, and Full Language Modeling: An autoregressive language model- phone numbers to avoid learning personal information. ing objective where the model is asked to predict future tokens given the previous tokens, an example is shown in Figure 5. Prefix Language Modeling: A non-causal training objective, 2.9. Architectures where a prefix is chosen randomly and only remaining target Here we discuss the variants of the transformer architectures tokens are used to calculate the loss. An example is shown in used in LLMs. The difference arises due to the application of Figure 5. 5 Figure 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning. “RL” stands for reinforcement learning, “RM” represents reward-modeling, and “RLHF” represents reinforcement learning with human feedback. Masked Language Modeling: In this training objective, tokens 2.12. LLMs Adaptation Stages or spans (a sequence of tokens) are masked randomly and the This section discusses the fundamentals of LLMs adaptation model is asked to predict masked tokens given the past and stages, from pre-training to fine-tuning for downstream tasks future context. An example is shown in Figure 5. and utilization. An example of different training stages and in- Unified Language Modeling: Unified language modeling ference in LLMs is shown in Figure 6. In this paper, we refer is a combination of causal, non-causal, and masked language to alignment-tuning as aligning with human preferences, while training objectives. Here in masked language modeling, the occasionally the literature uses the term alignment for different attention is not bidirectional but unidirectional, attending either purposes. left-to-right or right-to-left context. 2.12.1. Pre-Training In the very first stage, the model is trained in a self- 2.11. LLMs Scaling Laws supervised manner on a large corpus to predict the next to- kens given the input. The design choices of LLMs vary from Scaling laws study the optimal combination of model param- encoder-decoder to decoder-only architectures with different eters, dataset size, and computational resources that predict the building blocks and loss functions in sections 2.5, 2.4, 2.10. improvement in the model performance. It has been shown that the loss scales according to the power-law with model size, 2.12.2. Fine-Tuning dataset size, and compute resources. This study suggests There are different styles to fine-tune an LLM. This section larger models are more important than big data for better perfor- briefly discusses fine-tuning approaches. mance. Another variant of scaling law suggests the model Transfer Learning: The pre-trained LLMs perform well for size and the number of training tokens should be scaled equally. various tasks [6, 15]. However, to improve the performance for 6 a downstream task, pre-trained models are fine-tuned with the whereas to improve LLMs further on reasoning tasks many task-specific data [10, 11], known as transfer learning. methods [16, 97] train them on reasoning datasets. We discuss Instruction-tuning: To enable a model to respond to user various prompting techniques for reasoning below. queries effectively, the pre-trained model is fine-tuned on in- Chain-of-Thought (CoT): A special case of prompting where struction formatted data i.e., instruction and an input-output demonstrations contain reasoning information aggregated with pair. Instructions generally comprise multi-task data in plain inputs and outputs so that the model generates outcomes with natural language, guiding the model to respond according to the step-by-step reasoning. More details on CoT prompts are avail- prompt and the input. This type of fine-tuning improves zero- able in [55, 103, 101]. shot generalization and downstream task performance. Details Self-Consistency: Improves CoT performance by generat- on formatting instruction data and its various styles are avail- ing multiple responses and selecting the most frequent an- able in [16, 50, 97]. swer. Alignment-tuning: LLMs are prone to generating false, biased, Tree-of-Thought (ToT): Explores multiple reasoning paths and harmful text. To make them helpful, honest, and harmless, with possibilities to look ahead and backtrack for problem- models are aligned using human feedback. Alignment involves solving. asking LLMs to generate unexpected responses and then updat- Single-Turn Instructions: In this prompting setup, LLMs are ing their parameters to avoid such responses [20, 21, 98]. queried only once with all the relevant information in the It ensures LLMs operate according to human intentions and prompt. LLMs generate responses by understanding the con- values. A model is defined to be an “aligned” model if the text either in a zero-shot or few-shot setting. model fulfills three criteria of helpful, honest, and harmless or Multi-Turn Instructions: Solving a complex task requires mul- “HHH”. tiple interactions with LLMs, where feedback and responses Researchers employ reinforcement learning with human feed- from the other tools are given as input to the LLM for the next back (RLHF) for model alignment. In RLHF, a fine-tuned rounds. This style of using LLMs in the loop is common in model on demonstrations is further trained with reward model- autonomous agents. ing (RM) and reinforcement learning (RL), shown in Figure 6. Below we briefly discuss RM and RL pipelines in RLHF. 3. Large Language Models Reward modeling: trains a model to rank generated responses according to human preferences using a classification objec- This section reviews LLMs, briefly describing their architec- tive. To train the classifier humans annotate LLMs generated tures, training objectives, pipelines, datasets, and fine-tuning responses based on the HHH criteria. details. Reinforcement learning: in combination with the reward model is used for alignment in the next stage. The previously trained 3.1. Pre-Trained LLMs reward model ranks LLM-generated responses into preferred vs. non-preferred, which is used to align the model with proxi- Here, we provide summaries of various well-known pre- mal policy optimization (PPO). This process repeats iteratively trained LLMs with significant discoveries, changing the course until convergence. of research and development in NLP. These LLMs have consid- erably improved the performance in NLU and NLG domains, 2.12.3. Prompting/Utilization and are widely fine-tuned for downstream tasks. Moreover, We Prompting is a method to query trained LLMs for generating also identify key findings and insights of pre-trained LLMs in responses, as illustrated in Figure 6. LLMs can be prompted in Table 1 and 2 that improve their performance. various prompt setups, where they can be adapted to the instruc- tions without fine-tuning and in other cases with fine-tuning on 3.1.1. General Purpose data containing different prompt styles [16, 101, 102]. A good T5 : An encoder-decoder model employing a unified text- guide on prompt engineering is available at. Below, we to-text training for all NLP problems is shown in Figure 7. T5 will discuss various widely used prompt setups. places layer normalization outside the residual path in a conven- Zero-Shot Prompting: LLMs are zero-shot learners and ca- tional transformer model. It uses masked language mod- pable of answering queries never seen before. This style of eling as a pre-training objective where spans (consecutive to- prompting requires LLMs to answer user questions without see- kens) are replaced with a single mask instead of separate masks ing any examples in the prompt. for each token. This type of masking speeds up the training as In-context Learning: Also known as few-shot learning, here, it produces shorter sequences. After pre-training, the model is multiple input-output demonstration pairs are shown to the fine-tuned using adapter layers for downstream tasks. model to generate the desired response. This adaptation style GPT-3 : The GPT-3 architecture is the same as the GPT- is also called few-shot learning. A discussion on formatting in- 2 but with dense and sparse attention in transformer layers context learning (ICL) templates is available in [54, 50, 18, 16]. similar to the Sparse Transformer. It shows that large mod- Reasoning in LLMs: LLMs are zero-shot reasoners and can els can train on larger batch sizes with a lower learning rate to be provoked to generate answers to logical problems, task decide the batch size during training, GPT-3 uses the gradient planning, critical thinking, etc. with reasoning. Generating noise scale as in. Overall, GPT-3 increases model param- reasons is possible only by using different prompting styles, eters to 175B showing that the performance of large language 7 plete fine-tuning and prompt fine-tuning as in where only prompt-related parameters are updated by inserting prompts at various positions, front, middle, and back. CPM-2 also pro- poses the INFMOE, a memory-efficient framework with a strat- egy to dynamically offload parameters to the CPU for inference at a 100B scale. It overlaps data movement with inference com- putation for lower inference time. ERNIE 3.0 : ERNIE 3.0 takes inspiration from multi- Figure 7: Unified text-to-text training example, source image from. task learning to build a modular architecture using Transformer- XL as the backbone. The universal representation mod- ule is shared by all the tasks, which serve as the basic block for task-specific representation modules, which are all trained jointly for natural language understanding, natural language generation, and knowledge extraction. This LLM is primar- ily focused on the Chinese language. It claims to train on the largest Chinese text corpora for LLM training, and achieved state-of-the-art in 54 Chinese NLP tasks. Jurassic-1 : A pair of auto-regressive language mod- els, including a 7B-parameter J1-Large model and a 178B- parameter J1-Jumbo model. The training vocabulary of Jurassic-1 comprise word pieces, complete words, and multi- word expressions without any word boundaries, where possible out-of-vocabulary instances are interpreted as Unicode bytes. Compared to the GPT-3 counterparts, the Jurassic-1 models Figure 8: The image is the article of , showing an example of PanGu-α apply a more balanced depth-to-width self-attention architec- architecture. ture and an improved tokenizer for a faster prediction based on broader resources, achieving a comparable perfor- models improves with the scale and is competitive with the fine- mance in zero-shot learning tasks and a superior performance in tuned models. few-shot learning tasks given the ability to feed more examples mT5 : A multilingual T5 model trained on the mC4 as a prompt. dataset with 101 languages. The dataset is extracted from the HyperCLOVA : A Korean language model with GPT-3 public common crawl scrape. The model uses a larger vocab- architecture. ulary size of 250,000 to cover multiple languages. To avoid Yuan 1.0 : Trained on a Chinese corpus with 5TB of over-fitting or under-fitting for a language, mT5 employs a data high-quality text collected from the Internet. A Massive Data sampling procedure to select samples from all languages. The Filtering System (MDFS) built on Spark is developed to pro- paper suggests using a small amount of pre-training datasets, cess the raw data via coarse and fine filtering techniques. To including all languages when fine-tuning for a task using En- speed up the training of Yuan 1.0 to save energy expenses and glish language data. This allows the model to generate correct carbon emissions, various factors that improve the performance non-English outputs. of distributed training are incorporated in architecture and train- PanGu-α : An autoregressive model that has a query ing: like increasing the hidden state size improves pipeline and layer at the end of standard transformer layers, example shown tensor parallelism performance, larger micro batches improve in Figure 8, to predict the next token. Its structure is similar to pipeline parallelism performance, and larger global batch size the transformer layer but with an additional embedding for the improve data parallelism performance. In practice, the Yuan 1.0 next position in the attention mechanism, given in Eq. 3. model performs well on text classification, Winograd Schema, natural language inference, and reading comprehension tasks. a = pn Whq Whk T HLT (3) Gopher : The Gopher family of models ranges from 44M to 280B parameters in size to study the effect of scale CPM-2 : Cost-efficient Pre-trained language Models on the LLMs performance. The 280B model beats GPT-3 , (CPM-2) pre-trains bilingual (English and Chinese) 11B and Jurrasic-1 , MT-NLG , and others on 81% of the 198B mixture-of-experts (MoE) models on the WuDaoCor- evaluated tasks. pus dataset. The tokenization process removes “_” white ERNIE 3.0 TITAN : ERNIE 3.0 Titan extends ERNIE 3.0 space tokens in the sentencepiece tokenizer. The models are by training a larger model with 26x the number of parameters trained with knowledge inheritance, starting with only the Chi- of the latter. This bigger model outperformed other state-of-the- nese language in the first stage and then adding English and art models in 68 NLP tasks. LLMs produce text with incorrect Chinese data. This trained model gets duplicated multiple times facts. In order to have control of the generated text with fac- to initialize the 198B MoE model. Moreover, to use the model tual consistency, ERNIE 3.0 Titan adds another task, Credible for downstream tasks, CPM-2 experimented with both com- and Controllable Generations, to its multi-task learning setup. 8 It introduces additional self-supervised adversarial and control- lable language modeling losses to the pre-training step, which enables ERNIE 3.0 Titan to beat other LLMs in their manually selected Factual QA task set evaluations. GPT-NeoX-20B : An auto-regressive model that largely follows GPT-3 with a few deviations in architecture design, trained on the Pile dataset without any data deduplication. GPT- NeoX has parallel attention and feed-forward layers in a trans- former block, given in Eq. 4, that increases throughput by 15%. It uses rotary positional embedding , applying it to only 25% of embedding vector dimension as in. This reduces the computation without performance degradation. As opposed Figure 9: The BLOOM architecture example sourced from. to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B uses only dense layers. The hyperparameter tuning at this scale is difficult; therefore, the model chooses hyperparameters from relationship that model size should be doubled for every dou- the method and interpolates values between 13B and 175B bling of training tokens. Over 400 language models ranging models for the 20B model. The model training is distributed from 70 million to over 16 billion parameters on 5 to 500 bil- among GPUs using both tensor and pipeline parallelism. lion tokens are trained to get the estimates for compute-optimal training under a given budget. The authors train a 70B model x + Attn(LN1 (x)) + FF(LN2 (x)) (4) with the same compute budget as Gopher (280B) but with 4 times more data. It outperforms Gopher , GPT-3 , and OPT : It is a clone of GPT-3, developed to open-source others on various downstream tasks, after fine-tuning. a model that replicates GPT-3 performance. Training of OPT AlexaTM : An encoder-decoder model, where encoder employs dynamic loss scaling and restarts from an earlier weights and decoder embeddings are initialized with a pre- checkpoint with a lower learning rate whenever loss divergence trained encoder to speed up training. The encoder stays frozen is observed. Overall, the performance of OPT-175B models is for the initial 100k steps and is later unfrozen for end-to-end comparable to the GPT3-175B model. training. The model is trained on a combination of denoising BLOOM : A causal decoder model trained on the ROOTS and causal language modeling (CLM) objectives, concatenat- corpus to open-source an LLM. The architecture of BLOOM is ing a [CLM] token at the beginning for mode switching. Dur- shown in Figure 9, with differences like ALiBi positional em- ing training, the CLM task is applied for 20% of the time, which bedding, an additional normalization layer after the embedding improves the in-context learning performance. layer as suggested by the bitsandbytes1 library. These changes PaLM : A causal decoder with parallel attention and stabilize training with improved downstream performance. feed-forward layers similar to Eq. 4, speeding up training by GLaM : Generalist Language Model (GLaM) represents a a factor of 15. Additional changes to the conventional trans- family of language models using a sparsely activated decoder- former model include SwiGLU activation, RoPE embeddings, only mixture-of-experts (MoE) structure [121, 90]. To gain multi-query attention that saves computation cost during decod- more model capacity while reducing computation, the experts ing, and shared input-output embeddings. During training, loss are sparsely activated where only the best two experts are used spiking was observed, and to fix it, model training was restarted to process each input token. The largest GLaM model, GLaM from a 100-step earlier checkpoint by skipping 200-500 batches (64B/64E), is about 7× larger than GPT-3 , while only part of around the spike. Moreover, the model was found to memo- the parameters are activated per input token. The largest GLaM rize around 2.4% of the training data at the 540B model scale, (64B/64E) model achieves better overall results as compared whereas this number was lower for smaller models. to GPT-3 while consuming only one-third of GPT-3’s training PaLM-2 : A smaller multi-lingual variant of PaLM, energy. trained for larger iterations on a better quality dataset. PaLM- MT-NLG : A 530B causal decoder based on the GPT- 2 shows significant improvements over PaLM, while reducing 2 architecture that has roughly 3× GPT-3 model parameters. training and inference costs due to its smaller size. To lessen MT-NLG is trained on filtered high-quality data collected from toxicity and memorization, it appends special tokens with a various public datasets and blends various types of datasets in a fraction of pre-training data, which shows a reduction in gener- single batch, which beats GPT-3 on several evaluations. ating harmful responses. Chinchilla : A causal decoder trained on the same dataset U-PaLM : This method trains PaLM for 0.1% addi- as the Gopher but with a little different data sampling tional compute with the UL2 (also named as UL2Restore) ob- distribution (sampled from MassiveText). The model architec- jective , using the same dataset it outperforms the baseline ture is similar to the one used for Gopher, with the exception of significantly on various NLP tasks, including zero-shot, few- AdamW optimizer instead of Adam. Chinchilla identifies the shot, commonsense reasoning, CoT, etc. Training with UL2R involves converting a causal decoder PaLM to a non-causal de- coder PaLM and employing 50% sequential denoising, 25% 1 https://github.com/TimDettmers/bitsandbytes regular denoising, and 25% extreme denoising loss functions. 9 UL2 : An encoder-decoder architecture trained using a Codex : This LLM is trained on a subset of public Python mixture of denoisers (MoD) objective. Denoisers include 1) Github repositories to generate code from docstrings. Com- R-Denoiser: a regular span masking, 2) S-Denoiser: which cor- puter programming is an iterative process where the programs rupts consecutive tokens of a large sequence and 3) X-Denoiser: are often debugged and updated before fulfilling the require- which corrupts a large number of tokens randomly. During pre- ments. Similarly to this, Codex generates 100 versions of a training, UL2 includes a denoiser token from R, S , X to rep- program by repetitive sampling for a given description, which resent a denoising setup. It helps improve fine-tuning perfor- produces a working solution for 77.5% of the problems passing mance for downstream tasks that bind the task to one of the up- unit tests. Its powerful version powers Github Copilot2. stream training modes. This MoD style of training outperforms AlphaCode : A set of large language models, ranging the T5 model on many benchmarks. from 300M to 41B parameters, designed for competition-level GLM-130B : GLM-130B is a bilingual (English and Chi- code generation tasks. It uses the multi-query attention to nese) model trained using an auto-regressive mask infilling pre- reduce memory and cache costs. Since competitive program- training objective similar to the GLM. This training style ming problems highly require deep reasoning and an under- makes the model bidirectional as compared to GPT-3, which is standing of complex natural language algorithms, the Alpha- unidirectional. As opposed to GLM, the training of GLM-130B Code models are pre-trained on filtered GitHub code in popular includes a small amount of multi-task instruction pre-training languages and then fine-tuned on a new competitive program- data (5% of the total data) along with self-supervised mask in- ming dataset named CodeContests. The CodeContests dataset filling. To stabilize the training, it applies embedding layer gra- mainly contains problems, solutions, and test cases collected dient shrink. from the Codeforces platform3. The pre-training employs stan- LLaMA [127, 21]: A set of decoder-only language models dard language modeling objectives, while GOLD with varying from 7B to 70B parameters. LLaMA models series is tempering serves as the training objective for the fine- the most famous among the community for parameter efficiency tuning on CodeContests data. To evaluate the performance of and instruction tuning. AlphaCode, simulated programming competitions are hosted LLaMA-1 : Implements efficient causal attention on the Codeforces platform: overall, AlphaCode ranks at the by not storing and computing masked attention weights and top 54.3% among over 5000 competitors, where its Codeforces key/query scores. Another optimization is reducing the number rating is within the top 28% of recently participated users. of activations recomputed in the backward pass, as in. CodeT5+ : CodeT5+ is based on CodeT5 , with LLaMA-2 : This work is more focused on fine-tuning a shallow encoder and deep decoder, trained in multiple stages safer and better LLaMA-2-Chat model for dialogue generation. initially unimodal data (code) and later bimodal data (text-code The pre-trained model has 40% more training data with a larger pairs). Each training stage has different training objectives and context length and grouped-query attention. activates different model blocks encoder, decoder, or both ac- PanGu-Σ : An autoregressive model with parameters cording to the task. The unimodal pre-training includes span copied from PanGu-α and extended to a trillion scale with Ran- denoising and CLM objectives, whereas bimodal pre-training dom Routed Experts (RRE), the architectural diagram is shown objectives contain contrastive learning, matching, and CLM for in Figure 10. RRE is similar to the MoE architecture, with text-code pairs. CodeT5+ adds special tokens with the text to distinctions at the second level, where tokens are randomly enable task modes, for example, [CLS ] for contrastive loss, routed to experts in a domain instead of using a learnable gat- [Match] for text-code matching, etc. ing method. The model has bottom layers densely activated StarCoder : A decoder-only model with the SantaCoder and shared across all domains, whereas top layers are sparsely architecture, employing Flash attention to scale up the context activated according to the domain. This training style allows length to 8k. The StarCoder trains an encoder to filter names, extracting task-specific models and reduces catastrophic forget- emails, and other personal data from the training data. Its fine- ting effects in the case of continual learning. tuned variant outperforms PaLM, LLaMA, and LAMDA on HumanEval and MBPP benchmarks. 3.1.2. Coding 3.1.3. Scientific Knowledge CodeGen : CodeGen has similar architecture to Galactica : A large curated corpus of human scientific PaLM , i.e., parallel attention, MLP layers, and RoPE em- knowledge with 48 million papers, textbooks, lecture notes, beddings. The model is trained on both natural language and millions of compounds and proteins, scientific websites, en- programming language data sequentially (trained on the first cyclopedias, and more are trained using the metaseq library3, dataset, then the second and so on) on the following datasets which is built on PyTorch and fairscale. The model wraps 1) PILE, 2) BIGQUERY and 3) BIGPYTHON. CodeGen pro- reasoning datasets with the < work > token to provide step-by- posed a multi-step approach to synthesizing code. The purpose step reasoning context to the model, which has been shown to is to simplify the generation of long sequences where the previ- improve the performance on reasoning tasks. ous prompt and generated code are given as input with the next prompt to generate the next code sequence. CodeGen open- source a Multi-Turn Programming Benchmark (MTPB) to eval- 2 https://github.com/features/copilot uate multi-step program synthesis. 3 https://codeforces.com/ 10 Figure 11: An example image shows an instance of the Flan training paradigm, taken from. P Figure 10: This example illustrates the PanGu- architecture, as depicted in with minimal compute increment, e.g., 0.2% of the total pre- the image sourced from. training for PaLM 540B. We review various fine-tuned LLMs and strategies for effective fine-tuning in this section. 3.1.4. Dialog LaMDA : A decoder-only model pre-trained on pub- 3.2.1. Instruction-Tuning with Manually Created Datasets lic dialog data, public dialog utterances, and public web doc- Numerous hand-crafted instruction-tuning datasets with uments, where more than 90% of the pre-training data is in different design choices are proposed in the literature to English. LaMDA is trained with the objective of producing re- instruction-tune LLMs. The performance of fine-tuned LLMs sponses that exhibit high levels of quality, safety, and grounded- depends on multiple factors, such as dataset, instruction diver- ness. To achieve this, discriminative and generative fine-tuning sity, prompting templates, model size, and training objectives. techniques are incorporated to enhance the model’s safety and Keeping this in view, diverse fine-tuned models have emerged quality aspects. As a result, the LaMDA models can be utilized in the literature using manually created datasets. as a general language model performing various tasks. The models T0 and mT0 (multi-lingual) employ templates to convert existing datasets into prompt datasets. 3.1.5. Finance They have shown improvements in generalization to zero-shot BloombergGPT : A non-causal decoder model trained and held-out tasks. Tk-Instruct fine-tuned the T5 model using both financial ("FINPILE" from the Bloomberg archive) with in-context instructions to study generalization on unseen and general-purpose datasets. The model’s architecture is sim- tasks when given in-context instructions during test time. The ilar to the BLOOM and OPT. It allocates 50B param- model outperformed Instruct-GPT, despite being smaller in eters to different blocks of the model using the approach. size, i.e., 11B parameters as compared to 175B of GPT-3. For effective training, BloombergGPT packs documents to- Increasing Tasks and Prompt Setups: Zero-shot and few-shot gether with < |endo f text| > to use the maximum sequence performance improves significantly by expanding task collec- length, uses warmup batch size starting from 1024 to 2048, and tion and prompt styles. OPT-IML and Flan curated manually reduces the learning rate multiple times during the larger 2k and 1.8k task datasets, respectively. While increasing training. task size alone is not enough, OPT-IML and Flan add more Xuan Yuan 2.0 : A Chinese financial chat model with prompting setups in their datasets, zero-shot, few-shot, and BLOOM’s architecture trained on a combination of general CoT. In continuation, CoT Collection fine-tunes Flan-T5 purpose, financial, general purpose instructions, and financial further on 1.88M CoT samples. Another method uses institutions datasets. Xuan Yuan 2.0 combined the pre-training symbolic tasks with tasks in T0, Flan, etc. and fine-tuning stages to avoid catastrophic forgetting. 3.2.2. Instruction-Tuning with LLMs Generated Datasets 3.2. Fine-Tuned LLMs Generating an instruction-tuning dataset requires carefully Pre-trained LLMs have excellent generalization abilities to writing instructions and input-output pairs, which are often unseen tasks. However, because they are generally trained with written by humans, smaller in size, and less diverse. To the objective of next token prediction, LLMs have limited ca- overcome this, self-instruct proposed an approach to pacity to follow user intent and are prone to generate unethical, prompt available LLMs to generate instruction-tuning datasets. toxic or inaccurate responses. For their effective utiliza- Self-instruct outperformed models trained on manually created tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and dataset SUPER-NATURALINSTRUCTIONS (a dataset with generate safe responses , which also results in increasing 1600+ tasks) by 33%. It starts with a seed of 175 tasks, zero-shot, few-shot, and cross-task generalization [97, 16, 18], 1 instruction, and 1 sample per task and iteratively generates 11 Table 1: Noteworthy findings and insights of pre-trained Large Language Models. Models Findings & Insights Encoder and decoder with shared parameters perform equivalently when parameters are not shared T5 Fine-tuning model layers (adapter layers) work better than the conventional way of training on only classification layers Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta- GPT-3 learners Large multi-lingual models perform equivalently to single language models on downstream tasks. mT5 However, smaller multi-lingual models perform worse PanGu-α LLMs have good few shot capabilities Prompt fine-tuning requires updating very few parameters while achieving performance compara- ble to full model fine-tuning Prompt fine-tuning takes more time to converge as compared to full model fine-tuning CPM-2 Inserting prompt tokens in-between sentences can allow the model to understand relations between sentences and long sequences In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator (aggregate information with the input text) for the model A modular LLM architecture with a universal representation module and task-specific representa- tion module helps in the finetuning phase ERNIE 3.0 Optimizing the parameters of a task-specific representation network during the fine-tuning phase is an efficient way to take advantage of the powerful pre-trained model The performance of LLM is highly related to the network size To improve runtime performance, more operations can be performed in parallel (width) rather than sequential (depth) Jurassic-1 To efficiently represent and fit more text in the same context length, the model uses a larger vo- cabulary to train a SentencePiece tokenizer without restricting it to word boundaries. This further benefits in few-shot learning tasks By employing prompt-based tuning, the performances of models can be improved, often surpassing HyperCLOVA those of state-of-the-art models when the backward gradients of inputs are accessible The model architecture that excels in pre-training and fine-tuning cases may exhibit contrasting Yuan 1.0 behavior in zero-shot and few-shot learning Gopher Relative encodings enable the model to evaluate for longer sequences than training. Additional self-supervised adversarial loss to distinguish between real and generated text improves ERNIE 3.0 Titan the model performance as compared to ERNIE 3.0 Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded layers GPT-NeoX-20B Initializing feed-forward output layers before residuals with scheme in avoids activations from growing with increasing depth and width Training on Pile outperforms GPT-3 on five-shot Table Continued on Next Page 12 Models Findings & Insights Restart training from an earlier checkpoint with a lower learning rate if loss diverges OPT Model is prone to generate repetitive text and stuck in a loop Galactica’s performance has continued to improve across validation set, in-domain, and out-of- domain benchmarks, even with multiple repetitions of the corpus, which is superior to existing research on LLMs Galactica A working memory token approach can achieve strong performance over existing methods on mathematical MMLU and MATH benchmarks. It sets a new state-of-the-art on several downstream tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%) The model capacity can be maintained at reduced computation by replacing the feed-forward layer in each transformer layer with a mixture-of-experts (MoE) The model trained on filtered data shows consistently better performances on both NLG and NLU tasks, where the effect of filtering is more significant on the former tasks GLaM Filtered pretraining corpora play a crucial role in the generation capability of LLMs, especially for the downstream tasks The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in the MoE layer. Given a fixed budget of computation, more experts contribute to a better perfor- mance LaMDA The model can be fine-tuned to learn to call different external information resources and tools For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed with a shallower encoder and a deeper decoder To achieve better performances, it is necessary to employ strategies such as massively scaling AlphaCode upsampling, followed by the filtering and clustering of samples into a compact set The utilization of novel sampling-efficient transformer architectures designed to facilitate large- scale sampling is crucial Simplifying problem descriptions can effectively improve the model’s performance The model size and the number of training tokens should be scaled proportionately: for each dou- Chinchilla bling of the model size, the number of training tokens should be doubled as well English-centric models produce better translations when translating to English as compared to non- English Generalized models can have equivalent performance for language translation to specialized small PaLM models Larger models have a higher percentage of training data memorization Performance has not yet saturated even at 540B scale, which means larger models are likely to perform better Encoder-decoder architecture is more suitable to train LLMs given bidirectional attention to the context than decoder-only AlexaTM Causal Language Modeling (CLM) task can be added to benefit the model with efficient in-context learning Placing layer norm at the beginning of each transformer layer improves the training stability Table Continued on Next Page 13 Models Findings & Insights Training with a mixture of denoisers outperforms PaLM when trained further for a few more FLOPs U-PaLM Training with a mixture of denoisers improves the infilling ability and open-ended text generation diversity Mode switching training enables better performance on downstream tasks UL2 CoT prompting outperforms standard prompting for UL2 Pre-training data with a small proportion of multi-task instruction data improves the overall model GLM-130B performance Multi-step prompting for code synthesis leads to a better user intent understanding and code gen- CodeGen eration A constant performance improvement is observed when scaling the model LLaMA Smaller models can achieve good performances with more training data and computing time Sparse models provide the benefits of large models at a lower computation cost Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for PanGu-Σ continual learning Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is cost-efficient while maintaining a performance similar to the original Pre-training with general-purpose and task-specific data improves task performance without hurt- BloombergGPT ing other model capabilities XuanYuan 2.0 Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting Causal LM is crucial for a model’s generation capability in encoder-decoder architectures CodeT5+ Multiple training objectives like span corruption, Causal LM, matching, etc complement each other for better performance StarCoder HHH prompt by Anthropic allows the model to follow instructions without fine-tuning Model trained on unfiltered data is more toxic but may perform better on downstream tasks after LLaMA-2 fine-tuning Model trained on unfiltered data requires fewer samples for safety alignment Data quality is important to train better models PaLM-2 Model and data size should be scaled with 1:1 proportions Smaller models trained for larger iterations outperform larger models 14 Table 2: Key insights and findings from the study of instruction-tuned Large Language Models. Models Findings & Insights Multi-task prompting enables zero-shot generalization and outperforms baselines T0 Even a single prompt per dataset task is enough to improve performance To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering questions regarding the usefulness of the retrieved documents WebGPT Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and synthesis via imitation learning and reinforcement learning Generating answers with references can make labelers easily judge the factual accuracy of answers Instruction tuning leads to a stronger generalization of unseen tasks More tasks improve generalization whereas only increasing task instances does not help Tk-INSTRUCT Supervised trained models are better than generalized models Models pre-trained with instructions and examples perform well for different types of inputs Instruction tuning enables zero-shot generalization to tasks never seen before Multi-lingual training leads to even better zero-shot generalization for both English and non- English mT0 and BLOOMZ Training on machine-translated prompts improves performance for held-out tasks with non-English prompts English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language tasks Creating a batch with multiple task examples is important for better performance Only example proportional sampling is not enough, training datasets should also be proportional for better generalization/performance Fully held-out and partially supervised tasks performance improves by scaling tasks or categories OPT-IML whereas fully supervised tasks have no effect Including small amounts i.e. 5% of pretraining data during fine-tuning is effective Only 1% reasoning data improves the performance, adding more deteriorates performance Adding dialogue data makes the performance worse Labelers’ judgment and well-defined alignment rules help the model generate better responses Good dialogue goals can be broken down into detailed natural language rules for the agent and the Sparrow raters The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates and resilience against adversarial probing Finetuning with CoT improves performance on held-out tasks Fine-tuning along with CoT data improves reasoning abilities CoT tuning improves zero-shot reasoning Flan Performance improves with more tasks Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models Improving the model’s performance with instruction tuning is compute-efficient Multitask prompting enables zero-shot generalization abilities in LLM WizardCoder Fine-tuning with re-written instruction-tuning data into a complex set improves performance Model learns to write safe responses with fine-tuning on safe demonstrations, while additional LLaMA-2-Chat RLHF step further improves model safety and make it less prone to jailbreak attacks LIMA Less high quality data is enough for fine-tuned model generalization 15 new instructions (52k) and instances (82k input-output pairs) Aligning Directly with SFT: The PPO in the RLHF pipeline using GPT-3. Contrary to this, Dynosaur uses the is complex, memory-intensive, and unstable, requiring mul- meta-data of datasets on Huggingface to prompt LLMs to tiple models, reward, value, policy, and reference models. generate multiple task instruction-tuning datasets. Avoiding this sophisticated alignment pipeline is possible by LLaMA Tuned: Various models in the literature instruction- incorporating minimal changes in the supervised fine-tuning tune LLaMA with GPT-3 or GPT-4 generated (SFT) pipeline as in [158, 159, 160], with better or compa- datasets. Among these, Alpaca , Vicuna , and rable performance to PPO. Direct preference optimization LLaMA-GPT-4 are a few general-purpose fine-tuned (DPO) trains a model directly on the human-preferred models, where Alpaca is trained on 52k samples from text- responses to maximize the likelihood of preferred against davinci-003, Vicuna on 70k samples from ShareGPT.com, unpreferred responses, with per-sample importance weight. and LLaMA-GPT-4 by re-creating Alpaca instructions from Reward ranked fine-tuning RAFT fine-tunes the model GPT-4. Goat fine-tunes LLaMA for arithmetic tasks on ranked responses by the reward model. Preference ranking (1 million samples) by generating data from ChatGPT and optimization (PRO) and RRHF penalize the model outperforms GPT-4, PaLM, BLOOM, OPT, etc., attributing its to rank responses with human preferences and supervised loss. success to the LLaMA’s consistent tokenization of numbers. On the other hand, chain-of-hindsight (CoH) provides HuaTuo is a medical knowledge model, fine-tuned with feedback to the model in language rather than reward, to learn a generated QA dataset of 8k instructions. good versus bad responses. Complex Instructions: Evol-Instruct [153, 154] prompts Aligning with Synthetic Feedback: Aligning LLMs with LLMs to convert given instructions into a more complex set. human feedback is slow and costly. The literature suggests a The instructions are iteratively evolved with re-writing instruc- semi-automated process to align LLMs by prompting LLMs to tions in complex wording and creating new instructions. With generate helpful, honest, and ethical responses to the queries, this style of automated instruction generation, WizardLM and fine-tuning using the newly created dataset. Constitutional (fine-tuned LLaMA on 250k instructions), outperforms Vicuna AI replaces human feedback in RLHF with AI, calling and Alpaca, and WizardCoder (fine-tuned StarCoder) it RL from AI feedback (RLAIF). AlpacaFarm designs beats Claude-Plus, Bard, and others. prompts to imitate human feedback using LLMs APIs. Oppo- site to constitutional AI, AlpacaFarm injects noise in feedback to replicate human mistakes. Self-Align prompts the LLM with ICL examples, instructing the LLM about what the 3.2.3. Aligning with Human Preferences response should contain to be considered useful and ethical. Incorporating human preferences into LLMs presents a The same LLM is later fine-tuned with the new dataset. significant advantage in mitigating undesirable behaviors and Aligning with Prompts: LLMs can be steered with prompts to ensuring accurate outputs. The initial work on alignment, such generate desirable responses without training [165, 166]. The as InstructGPT aligns GPT-3 using a 3-step approach, self-correction prompting in concatenates instructions instruction-tuning, reward modeling, and fine-tuning with and CoT with questions, guiding the model to answer its reinforcement learning (RL). The supervised fine-tuned GPT-3 instruction following a strategy to ensure moral safety before on demonstrations is queried to generate responses, which the actual answer. This strategy is shown to reduce the harm in human labelers rank according to human values, and a reward generated responses significantly. model is trained on the ranked data. Lastly, the GPT-3 is trained Red-Teaming/Jailbreaking/Adversarial Attacks: LLMs with proximal policy optimization (PPO) using rewards on the exhibit harmful behaviors, hallucinations, leaking personal in- generated data from the reward model. LLaMA 2-Chat formation, and other shortcomings through adversarial probing. improves alignment by dividing reward modeling into help- The models are susceptible to generating harmful responses fulness and safety rewards and using rejection sampling in even though they are aligned for safety [167, 168]. Red- addition to PPO. The initial four versions of LLaMA 2-Chat teaming is a common approach to address illicit outputs, where are fine-tuned with rejection sampling and then with PPO on the LLMs are prompted to generate harmful outputs [168, 169]. top of rejection sampling. The dataset collected through red-teaming is used to fine-tune Aligning with Supported Evidence: This style of alignment models for safety. While red-teaming largely relies on human allows the model to generate responses with proofs and facts, annotators, another work red-team LLMs to find prompts reduces hallucination, and assists humans more effectively, that lead to harmful outputs for other LLMs. which increases trust in the model’s output. Similar to the RLHF training style, a reward model is trained to rank generated responses containing web citations in answers 3.2.4. Continue Pre-Training to questions, which is later used to train the model, as in Although fine-tuning boosts a model’s performance, it leads GopherCite , WebGPT , and Sparrow. The to catastrophic forgetting of previously learned information. ranking model in Sparrow is divided into two branches, Concatenating fine-tuning data with a few randomly selected preference reward and rule reward, where human annotators pre-training samples in every iteration avoids network forget- adversarial probe the model to break a rule. These two rewards ting [171, 142]. This is also effective in adapting LLMs for together rank a response to train with RL. cases where fine-tuning data is small and the original capac- 16 ity is to be maintained. Prompt-based continued pre-training lightweight and the other with heavyweight attention and feed- (PCP) trains the model with text and instructions related forward layers. All tokens are processed from the lightweight to tasks and then finally instruction-tunes the model for down- branch, and only important tokens are routed to the heavy- stream tasks. weight branch. LongNet replaces standard attention with dilated attention, expanding sequence length to 1 billion tokens. 3.2.5. Sample Efficiency LongLoRA proposes shift-short attention, used during While fine-tuning data is generally many-fold smaller than fine-tuning to reduce dense attention costs. However, the model the pre-training data, it still has to be large enough for accept- during inference uses dense attention and achieves similar per- able performance [16, 97, 18] and requires proportional com- formance as full attention fine-tuning. puting resources. Studying the effects on performance with less Extrapolation without Training: LM-Infinite and par- data, existing literature [173, 174] finds that models trained allel context windows (PCW) show length extrapolation on less data can outperform models trained with more data. is possible using pre-trained LLMs. LM-Infinite suggested Λ- In , 25% of the total downstream data is found enough shaped attention applied within the original context window for state-of-the-art performance. Selecting coreset-based 0.5% limits. Likewise, PCW chunks larger inputs into the pre-trained of the total instruction-tuning data improves the model perfor- context lengths and applies the same positional encodings to mance by 2% in , as compared to the complete data tun- each chunk. ing. Less is more for alignment (LIMA) uses only 1000 carefully created demonstrations to fine-tune the model and has 3.4. Augmented LLMs achieved comparable performance to GPT-4. LLMs are capable of learning from the examples concate- nated with the input, known as context augmentation, in- 3.3. Increasing Context Window context learning (ICL), or few-shot prompting. They show ex- LLMs are trained with limited context windows due to ex- cellent generalization to unseen tasks with few-shot prompt- pensive attention and high memory requirements. A model ing, enabling LLMs to answer queries beyond the capacity ac- trained on limited sequence lengths fails to generalize to unseen quired during training [6, 55]. These emergent abilities allow lengths at inference time [176, 49]. Alternatively, LLMs with for adapting the model without fine-tuning—a costly process. ALiBi positional encodings can perform zero-shot length Aside from this, hallucination, producing inaccurate, unsafe, extrapolation. However, ALiBi has less expressive power or factually incorrect responses, is common for LLMs, which is and inferior performance on multiple benchmarks , and avoided by augmenting contextual data. While the user can pro- many LLMs use RoPE positional embedding that is unable to vide in-context samples in the query [54, 32], here we specifi- perform zero-shot extrapolation. A larger context length has cally refer to the methods that access external storage program- benefits such as a better understanding of longer documents, matically, calling them augmented LLMs. more samples in in-context learning, execution of bigger rea- The literature suggests various external memory designs to aug- soning processes, etc. Expanding context length during fine- ment LLMs, long-term [181, 182, 183, 184], short-term , tuning is slow, inefficient, and computationally expensive. symbolic , and non-symbolic [187, 188]. The memory Therefore, researchers employ various context window extrap- can be maintained in different formats such as documents, vec- olation techniques discussed below. tors, or databases. A few systems maintain intermediate mem- Position Interpolation: Rather than extrapolating, shows ory representations to retain information across multiple iter- that interpolating position encodings within the pre-trained con- ations [184, 182], while others extract important information text window are more effective. The work demonstrates that from the datasets and save it in memory for recall. The only 1000 steps of fine-tuning are enough to achieve better re- memory read and write operations are performed either with sults on larger windows without reducing performance com- or without LLMs cooperation [182, 190, 184, 191], acting as pared to the original context size. Giraffe uses power scal- a feedback signal in. We discuss different types of aug- ing in RoPE, and YaRN proposed NTK-aware interpola- mented LLMs below. tion. Efficient Attention Mechanism: Dense global attention is 3.4.1. Retrieval Augmented LLMs one of the major constraints in training larger context win- LLMs may have limited memory and outdated information, dow LLMs. Using efficient attention variants, such as lo- leading to inaccurate responses. Retrieving relevant informa- cal, sparse, and dilated attention, reduces the computation cost tion from external up-to-date storage enables the LLMs to significantly. LongT5 proposes transient global atten- accurately answer with references and utilize more informa- tion (TGlobal), applying attention to local and global tokens tion. With retrieval augmentation, smaller models have been (windowed token averaging). The model replaces attention shown to perform at par with larger models. For instance, the in T5 with TGlobal attention, pre-trains the model on 11B model can become competitive to 540B PaLM in and 4098 sequence length, fine-tunes on larger window sizes, as 7.5B to 280B Gopher in. Retrieval augmented language large as 16k, and improves task performance on longer inputs. modeling (RALM) has two major components, shown in This shows the extrapolation ability of TGlobal attention with Figure 12, namely: 1) retriever and 2) language model. In only fine-tuning. COLT5 uses two branches, one with RALM, the retriever plays a crucial role in driving LLM 17 ation. Retrieved samples are ranked to build ground-truth data to train retrievers with contrastive learning in [196, 198]. RoBERTa is trained for downstream tasks in for ICL samples retrieval. REPLUG trains the retriever with supervised signals from the frozen LLM-generated outputs. Training Retriever and LLM: Further benefits are achieved by training both the retriever and the model in [25, 200, 201]. In this case

Document Details

Tags

Related

Full Transcript

Upgrade to continue