Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

Appendix

Large pre-trained transformer language models, or simply large language models (LLM), are a recent breakthrough in machine learning that have vastly extended our capabilities in natural language processing (NLP). Based on transformer architectures, with as many as hundreds of billions of parameters, and trained on hundreds of terabytes of textual data, recent LLMs such as GPT-3 (OpenAI, 2020), GPTNeoX (EleutherAI, 2022), PaLM (Google Brain,2022), OPT (Meta AI, 2022), and Macaw (Allen Institute) have demonstrated significant improvements in the ability to perform a wide range of NLP tasks. Here’s a brief introduction to the model architecture at play here:
Artboard 50@33.33x-100
Large language models are computer programs that open new possibilities of text understanding and generation in software systems, CohereAI Large Language Models.

TRANSFORMER MODEL ARCHITECTURE

Modern LLMs are based on the transformer architecture. The
main architectural unit is a transformer block, which consists of
(at a minimum) multi-headed self attention, layer normalization,
a dense two-layer feedforward network, and residual connection. A transformer stack is a sequence of such blocks.
The below graph shows a typical transformer architecture with
an encoder-decoder structure:

Artboard 40
The transformer model architecture. Source: Attention Is All You Need

Since the advent of transformers, many architectural variants have been proposed. These can vary by architecture (e.g., decoder-only models, encoder-decoder models), by pretraining objectives (e.g., full language modeling, prefix language modeling, masked language modeling), and other factors.

While the original transformer included a separate encoder that processes input text and a decoder that generates target text (encoder-decoder models), the most popular LLMs like GPT-3, OPT, PaLM, and GPT-NeoX are causal decoder-only models trained to autoregressively predict a text sequence.

In contrast with this trend, there is some research showing that encoder-decoder models outperform decoder-only LLMs for transfer learning (i.e., where a pre-trained model is fine-tuned on a single downstream task). For detailed architecture types and comparison, see What Language Model Architecture and Pre-training Objective Work Best for Zero-Shot Generalization.

Here are a few of the most popular pre-training architectures:

  • Encoder-decoder models: As originally proposed, the transformer consists of two stacks: an encoder and a decoder. The encoder is fed the sequence of input tokens and outputs a sequence of vectors of the same length as the input. Then, the decoder autoregressively predicts the target sequence, token by token, conditioned on the output of the encoder. Representative models of this type include T5 and BART.

  • Causal decoder-only models: These are decoder-only models trained to autoregressively predict a text sequence. “Causal” means that the model is just concerned with the left context (next-step prediction). Representative examples of this type include GPT-3, GPT-J, GPT-NeoX, and OPT.

  • Non-causal decoder-only models: To allow decoder-only models to build richer non-causal representations of the input text, the attention mask has been modified so that the region of the input sequence corresponding to conditioning information has a non-causal mask (i.e., not restricted to past tokens). Representative PLM models include UniLM 1-2 and ERNIE-M.

  • Masked language models: These are normally encoder-only models pre-trained with a masked language modeling objective, which predict masked text pieces based on surrounding context. Representative MLM models include BERT and ERNIE.

yws
Community-driven open sourcing of GPT et al., State of AI Report 2022