For more information or if you need help retrieving your data, please contact Weights & Biases Customer Support at support@wandb.com
Another evaluation step is n-shot learning. This is a task-agnostic dimension and refers to the number of supervised samples (demonstrations) provided to the model right before asking it to perform a given task. N-shots are typically provided via a technique called prompting. These evaluations are often categorized into the following three groups:
Task: Sentiment Analysis
Prompt:
Tweet: “I hate it when my phone battery dies.”
Sentiment: Negative
Tweet: “My day has been amazing!”
Sentiment: Positive
Tweet: “This is the link to the article.”
Sentiment: Neutral
Tweet: “This new music video was incredible!”
Sentiment:
Answer: ______
Evaluation typically involves both looking at benchmarking metrics of the above tasks and more manual evaluation by feeding the model with prompts and looking at completions for human assessment. Typically, both NLP Engineers and subject matter experts (SMEs) are involved in the evaluation process and assess the model performance from different angles:
NLP engineers are people with a background in NLP, computational linguistics, prompt engineering, etc., who can probe and assess the model’s semantic and syntactic shortcomings and come up with model failure classes for continuous improvement. A failure class example would be: “the LLM does not handle arithmetic with either integers (1, 2, 3, etc.) nor their spelled-out forms: one, two, three.”
Subject matter experts (SMEs) are, in contrast to the NLP engineers, asked to probe specific classes of LLM output, fixing errors where necessary, and “talking aloud” while doing so. The SMEs are required to explain in a step-by-step fashion the reasoning and logic behind their correct answer versus the incorrect machine-produced answer.