Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

MODEL EVALUATION

Typically, pre-trained models are evaluated on diverse language model datasets to assess their ability to perform logical reasoning, translation, natural language inference, question answering, and more. Machine learning practitioners have coalesced around a variety of standard evaluation benchmarks. A few popular examples include:
  • Open-Domain Question Answering tasks: TriviaQA, Natural Questions, Web Questions
  • Natural Language Inference (NLI): SNLI, QNLI
  • Reasoning tasks: Arithmetic reasoning tasks
  • Code tasks: HumanEval, MBPP (text-to-code); TransCoder (code-to-code)
  • Translation tasks: Translation BLEU score on WMT language pairs
  • BIG-bench: A collaborative benchmark aimed at producing challenging tasks for large language models, including 200+ tasks covering diverse textual tasks and programmatic tasks.
  • LM Evaluation Harness: A library for standardized evaluation of autoregressive LLMs across 200+ tasks released by EleutherAI. It has gained popularity because of its systematic framework approach and robustness.
  • Cloze and Completion tasks: LAMBADA, HellaSwag, StoryCloze
  • Winograd-style tasks: Winograd, WinoGrande
  • Common Sense Reasoning: PIQA, ARC, OpenBookQA
  • In-context Reading Comprehension: DROP, CoQA, QuAC, SQuADv2, RACE, SuperGLUE
Artboard 43
Datasets and task clusters in NLP, "Finetuned Language Models are Zero-Shot Learners

Another evaluation step is n-shot learning. This is a task-agnostic dimension and refers to the number of supervised samples (demonstrations) provided to the model right before asking it to perform a given task. N-shots are typically provided via a technique called prompting. These evaluations are often categorized into the following three groups:

  • Zero-shot: Evaluation on tasks without providing any supervised samples to the model at inference time.
  • One-shot: Similar to few-shot but with n=1n = 1, where one supervised sample is provided to the model at inference time.
  • Few-shot: Evaluation where a few supervised samples are provided to the model at inference time (e.g., 5 samples provided → 5-shot).

Example of Few-shot Learning

Task: Sentiment Analysis
Prompt:
Tweet: “I hate it when my phone battery dies.”
Sentiment: Negative
Tweet: “My day has been amazing!”
Sentiment: Positive
Tweet: “This is the link to the article.”
Sentiment: Neutral
Tweet: “This new music video was incredible!”
Sentiment:
Answer: ______

Evaluation typically involves both looking at benchmarking metrics of the above tasks and more manual evaluation by feeding the model with prompts and looking at completions for human assessment. Typically, both NLP Engineers and subject matter experts (SMEs) are involved in the evaluation process and assess the model performance from different angles:

NLP engineers are people with a background in NLP, computational linguistics, prompt engineering, etc., who can probe and assess the model’s semantic and syntactic shortcomings and come up with model failure classes for continuous improvement. A failure class example would be: “the LLM does not handle arithmetic with either integers (1, 2, 3, etc.) nor their spelled-out forms: one, two, three.”

Subject matter experts (SMEs) are, in contrast to the NLP engineers, asked to probe specific classes of LLM output, fixing errors where necessary, and “talking aloud” while doing so. The SMEs are required to explain in a step-by-step fashion the reasoning and logic behind their correct answer versus the incorrect machine-produced answer.