Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

INSTRUCTION TUNING

At this point, let’s assume we have a pre-trained, general-purpose LLM. If we did our job well, our model can already be used for domain-specific tasks without tuning for few-shot learning and zero-shot learning scenarios. That said, zero-shot learning is in general much worse than its few-shot counterpart in plenty of tasks like reading comprehension, question answering, and natural language inference. One potential reason is that, without few-shot examples, it’s harder for models to perform well on prompts that are not similar to the format of the pretraining data.

To solve this issue, we can use instruction tuning. Instruction tuning is a state-of-the-art fine-tuning technique that fine-tunes pre-trained LLMs on a collection of tasks phrased as instructions. It enables pre-trained LLMs to respond better to instructions and reduces the need for few-shot examples at the prompting stage (i.e., drastically improves zero-shot performance).

Instruction tuning has gained huge popularity in 2022, given that the technique considerably improves model performance without hurting its ability to generalize. Typically, a pre-trained LLM is tuned on a set of language tasks and evaluated on its ability to perform another set of language tasks unseen at tuning time, proving its generalizability and zero-shot capability. See illustration below:

Artboard 44
Comparing instruction tuning with pretrain–finetune and prompting, Finetuned Language Models are Zero-Shot Learners.
A few things to keep in mind about instruction tuning: – Instruction tuning tunes full model parameters as opposed to freezing a part of them in parameter-efficient fine-tuning. That means it doesn’t bring with it the cost benefits that come with parameter-efficient fine-tuning. However, given that instruction tuning produces much more generalizable models compared to parameter-efficient fine-tuning, instruction-tuned models can still serve as a general-purpose model serving multiple downstream tasks. It often comes down to whether you have the instruction dataset available and the training budget to perform instruction tuning. – Instruction tuning is universally effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation), but it is a little trickier for tasks like reasoning. To improve for these tasks, you’ll want to include chain-of-thought examples during tuning.
Artboard 46
Artboard 45
Instruction tuning both with and without exemplars (i.e., zero-shot and few-shot) and with and without chain-of-thought, enabling generalization across a range of evaluation scenarios from Scaling Instruction-Finetuned Language Models