Current best practices for training LLMs from scratch

DATASET COLLECTION

Bad data leads to bad models. But careful processing of high-quality, high-volume, diverse datasets directly contributes to model performance in downstream tasks as well as model convergence.

Dataset diversity is especially important for LLMs. That’s because diversity improves the cross-domain knowledge of the model, as well as its downstream generalization capability. Training on diverse examples effectively broadens the ability of your LLM to perform well on myriad nuanced tasks.

A typical training dataset is comprised of textual data from diverse sources, such as crawled public data, online publication or book repositories, code data from GitHub, Wikipedia, news, social media conversations, etc.

For example, consider The Pile. The Pile is a popular text corpus created by EleutherAI for large-scale language modeling. It contains data from 22 data sources, coarsely broken down into five broad categories:

Academic Writing: PubMed Abstracts and PubMed Central, arXiv, FreeLaw, USPTO Backgrounds, PhilPapers, NIH Exporter.
Online or Scraped Resources: CommonCrawl, OpenWebText2, Stack Exchange, Wikipedia.
Prose: BookCorpus2, Bibliotik, Project Gutenberg.
Dialog: YouTube subtitles, Ubuntu IRC, OpenSubtitles, Hacker News, Europarl.
Miscellaneous: GitHub, the DeepMind Mathematics dataset, Enron emails.

Note that The Pile dataset is one of the very few large-scale text datasets that is free for the public. For most of the existing models like GPT-3, PaLM, and Galactica, their training and evaluation datasets are not publicly available. Given the large-scale effort it takes to compile and pre-process these datasets for LLM training, most companies have kept them in-house to maintain a competitive advantage. That makes datasets like The Pile and a few datasets from AllenAI extremely valuable for public large-scale NLP research purposes.

Another thing worth mentioning is that, during dataset collection, general data can be collected by non-experts, but data for specific domains normally needs to be collected or consulted by subject matter experts (SMEs), e.g., doctors, physicists, lawyers, etc. SMEs can flag thematic or conceptual gaps that NLP engineers might miss. NLP engineers should also be heavily involved at this stage given their knowledge of how an LLM “learns to represent data” and thus their abilities to flag any data oddities or gaps in the data that SMEs might miss. Once you’ve identified the dataset(s) you’ll be using, you’ll want to prepare that data for your model. Let’s get into that now:

Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

DATASET COLLECTION

The Platform

Article

Resources

Company

Use cases

Industries

Learn more

Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

DATASET COLLECTION

The Platform

Article

Resources

Company

Use cases

Industries