For more information or if you need help retrieving your data, please contact Weights & Biases Customer Support at support@wandb.com
To ensure training data is high-quality and diverse, several pre-processing techniques can be used before the pre-training steps:
Certain data components can be up-sampled to obtain a more balanced data distribution. Some research down-samples lower-quality datasets such as unfiltered web crawl data. Other research up-samples data of a specific set of domains depending on the model objectives.
Due to its goals, the pre-training dataset is composed of high-quality data mainly from science resources such as papers, textbooks, lecture notes, encyclopedias. The dataset is also highly curated, for example, with task-specific datasets to facilitate the composition of this knowledge into new task contexts.
There are also advanced methods to filter high-quality data, such as using a trained classifier model applied to the dataset. For example, the model Galactica by Meta AI is built purposefully for science, specifically storing, combining, and reasoning about scientific knowledge.
Some researchers see significant benefits from deduplicating training data. Fuzzy deduplication methods such as locality-sensitive hashing (LSH) are commonly used here. See Deduplicating Training Data Makes Language Models Better paper to understand details regarding deduplication.