For more information or if you need help retrieving your data, please contact Weights & Biases Customer Support at support@wandb.com
Training a multi-billion parameter LLM is usually a highly
experimental process with lots of trial and error. Normally, the
team would start with a much smaller model size, make sure
it’s promising, and scale up to more and more parameters.
Keep in mind that as you scale, there will be issues that require
addressing which simply won’t be present when training on
smaller data sizes.
Let’s look at some common pre-training steps, starting with
architecture.
You’ll want to do a lot of this early in your pre-training process. This is largely because you’ll be dealing with smaller amounts of data, letting you perform more experiments early versus when they’ll be far more costly down the line.
Before we continue, it’s worth being clear about a reality here: you will likely run into issues when training LLMs. After all: these are big projects and, like anything sufficiently large and complicated, things can go wrong.
During the course of training, a significant number of hardware failures can occur in your compute clusters, which will require manual or automatic restarts. In manual restarts, a training run is paused, and a series of diagnostics tests are conducted to detect problematic nodes. Flagged nodes should then be cordoned off before you resume training from the last saved checkpoint.
Training stability is also a fundamental challenge. While training the model, you may notice that hyperparameters such as learning rate and weight initialization directly affect model stability. For example, when loss diverges, lowering the learning rate and restarting from an earlier checkpoint might allow the job to recover and continue training.
Additionally, the bigger the model is, the more difficult it is to avoid loss spikes during training. These spikes are likely to occur at highly irregular intervals, sometimes late into training.
There hasn’t been a lot of systematic analysis of principled strategies to mitigate spikes. Here are some best practices we have seen from the industry to effectively get models to converge:
Note: Most of the above model convergence best practices not only apply to transformer training but also apply in a broader deep learning context across architectures and use cases.
Finally, after your LLM training is completed, it is very important to ensure that your model training environment is saved and retained in that final state. That way, if you need to re-do anything or replicate something in the future, you can because you have the training state preserved.
A team could also try some ablation studies. This allows you to see how pulling parts of the model out might impact performance. Ablation studies can allow you to massively reduce the size of your model while still retaining most of a model’s predictive power.