Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

PRE-TRAINING STEPS

Training a multi-billion parameter LLM is usually a highly
experimental process with lots of trial and error. Normally, the
team would start with a much smaller model size, make sure
it’s promising, and scale up to more and more parameters.
Keep in mind that as you scale, there will be issues that require
addressing which simply won’t be present when training on
smaller data sizes.
Let’s look at some common pre-training steps, starting with
architecture.

Model Architecture

To reduce the risk of training instabilities, practitioners often start with the model architecture and hyperparameters of a popular predecessor such as GPT-2 and GPT-3, making informed adjustments along the way to improve training efficiency, scale the size of the models (in both depth and width), and enhance performance. Two examples: GPT-NeoX-20B (20B, EleutherAI) originally took GPT-3’s architecture and made these changes:
  • Rotary embedding used for the first 25% of embedding vector dimensions instead of learned positional embeddings, to balance performance and computational efficiency.
  • Parallel attention combined with feed-forward layers instead of running them in series, primarily for computational efficiency purposes.
  • While GPT-3 uses alternating dense and sparse layers, GPT-NeoX exclusively uses dense layers to reduce implementation complexity.
OPT-175B (175B, Meta AI) also built on GPT-3 and adjusted:
    • Batch size for increased computational efficiency.
    • Learning rate schedule: Specifically, it follows a linear learning rate (LR) schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller baselines, then decaying down to 10% of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required.
    • Token amount. OPT-175B, despite the same model size as GPT-3 (175B), was trained on a much smaller dataset of 180B tokens (as compared to the 300B tokens used by GPT-3).

Experiments and Hyperparameter Search

As we mentioned above, typical pre-training involves lots of experiments to find the optimal setup for model performance. Experiments can involve any or all of the following: weight initialization, positional embeddings, optimizer, activation, learning rate, weight decay, loss function, sequence length, number of layers, number of attention heads, number of parameters, dense vs. sparse layers, batch size, and dropout. A combination of manual trial and error of those hyperparameter combinations and automatic hyperparameter optimization (HPO) are typically performed to find the optimal set of configurations to achieve optimal performance. Typical hyperparameters to perform automatic search on: learning rate, batch size, dropout, etc. Hyperparameter search is an expensive process and is often too costly to perform at full scale for multi-billion parameter models. It’s common to choose hyperparameters based on a mixture of experiments at smaller scales and by interpolating parameters based on previously published work instead of from scratch. In addition, there are some hyperparameters that need to be adjusted even during a training epoch to balance learning efficiency and training convergence. Some examples:
  • Learning rate: can increase linearly during the early stages, then decay towards the end.
  • Batch size: it’s not uncommon to start with smaller batch sizes and gradually ramp up to larger ones.
Artboard 42
You’ll want to do a lot of this early in your pre-training process. This is largely because you’ll be dealing with smaller amounts of data, letting you perform more experiments early versus when they’ll be far more costly down the line. Before we continue, it’s worth being clear about a reality here: you will likely run into issues when training LLMs. After all: these are big projects and, like anything sufficiently large and complicated, things can go wrong.

Hardware Failure

During the course of training, a significant number of hardware failures can occur in your compute clusters, which will require manual or automatic restarts. In manual restarts, a training run is paused, and a series of diagnostics tests are conducted to detect problematic nodes. Flagged nodes should then be cordoned off before you resume training from the last saved checkpoint.

Training Instability

Training stability is also a fundamental challenge. While training the model, you may notice that hyperparameters such as learning rate and weight initialization directly affect model stability. For example, when loss diverges, lowering the learning rate and restarting from an earlier checkpoint might allow the job to recover and continue training. Additionally, the bigger the model is, the more difficult it is to avoid loss spikes during training. These spikes are likely to occur at highly irregular intervals, sometimes late into training. There hasn’t been a lot of systematic analysis of principled strategies to mitigate spikes. Here are some best practices we have seen from the industry to effectively get models to converge:
  • Batch size: In general, using the biggest batch size that your GPU allows you to use is the best policy here.
  • Batch Normalization: Normalizing the activations within a mini-batch can speed up convergence and improve model performance.
  • Learning Rate Scheduling: A high learning rate can cause the loss to oscillate or diverge, leading to loss spikes. By scheduling the learning rate to decrease over time, you can gradually decrease the magnitude of updates to the model’s parameters and improve stability. Common schedules include step decay, where the learning rate is decreased by a fixed amount after a fixed number of steps, and exponential decay, where the learning rate is decreased by a fixed factor each step. Note that it is not really possible to know ahead of time what learning rate (LR) to use, but you can use different LR schedules to see how your model responds.
  • Weight Initialization: Properly initializing the weights can help the model converge faster and improve performance. For example, it is common to use small Gaussian noise or, in the case of Transformers, the T-Fixup initialization. Techniques include: Random initialization Layer-wise initialization Initialization using pretrained weights.
  • Model training starting point: Using a pretrained model that is trained on related tasks as a starting point can help the model converge faster and improve performance.
  • Regularization: Regularization techniques, such as dropout, weight decay, and L1/L2 regularization, can help the model converge better by reducing overfitting and improving generalization.
  • Data Augmentation: Augmenting the training data by applying transformations can help the model generalize better and reduce overfitting.
  • Hot-swapping during training: Hot-swapping of optimizers or activation functions are sometimes used during LLM training to fix issues as they appear during the process. It sometimes requires a team on it almost 24/7 trying various heuristics to train further.
  • Other simple strategies mitigating the instability issue when encountered: Restart training from a previous checkpoint. Skip some data batches that were seen during the spike (the intuition is that spikes occur due to the combination of specific data batches with a particular model parameter state).
Note: Most of the above model convergence best practices not only apply to transformer training but also apply in a broader deep learning context across architectures and use cases. Finally, after your LLM training is completed, it is very important to ensure that your model training environment is saved and retained in that final state. That way, if you need to re-do anything or replicate something in the future, you can because you have the training state preserved. A team could also try some ablation studies. This allows you to see how pulling parts of the model out might impact performance. Ablation studies can allow you to massively reduce the size of your model while still retaining most of a model’s predictive power.