Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

DATASET PRE-PROCESSING

In this section, we’ll cover both data adjustments (like deduplication and cleaning) and the pros and cons of various tokenization strategies. Let’s start with the former:

Dataset Handling

To ensure training data is high-quality and diverse, several pre-processing techniques can be used before the pre-training steps:

Data sampling:

Certain data components can be up-sampled to obtain a more balanced data distribution. Some research down-samples lower-quality datasets such as unfiltered web crawl data. Other research up-samples data of a specific set of domains depending on the model objectives.

Due to its goals, the pre-training dataset is composed of high-quality data mainly from science resources such as papers, textbooks, lecture notes, encyclopedias. The dataset is also highly curated, for example, with task-specific datasets to facilitate the composition of this knowledge into new task contexts.

There are also advanced methods to filter high-quality data, such as using a trained classifier model applied to the dataset. For example, the model Galactica by Meta AI is built purposefully for science, specifically storing, combining, and reasoning about scientific knowledge.

Data cleaning

Normally, data cleaning and reformatting efforts are applied before training. Examples include removing boilerplate text and removing HTML code or markup. In addition, for some projects, fixing misspellings, handling cross-domain homographs, and/ or removing biased / harmful speech are performed to improve model performance. For other projects, these techniques are not used under the idea that models should see the fair representation of the real world and learn to deal with misspellings and toxicity as a part of the model capabilities.

Non-standard textual components handling

In some cases, it is important to convert non-standard textual components into texts, e.g. converting emoji into their text equivalent: becomes ❄️This conversion can be done programmatically, of course.

Data deduplication

Some researchers see significant benefits from deduplicating training data. Fuzzy deduplication methods such as locality-sensitive hashing (LSH) are commonly used here. See Deduplicating Training Data Makes Language Models Better paper to understand details regarding deduplication.

Downstream task data removal

Data leakage happens when the data you are using to train happens to have the information you are trying to predict. Downstream task data removal methods (such as n-grams) are needed to remove training data also present in the evaluation dataset.