BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Recent advances in large language model (LLM) pretraining have shown thatsimply scaling data quantity eventually leads to diminishing returns, hitting adata wall. In response, the use of synthetic data for pretraining has emergedas a promising paradigm for pushing the frontier of performance. Despite this,the factors affecting synthetic data quality remain poorly understood. In thiswork, we introduce BeyondWeb, a synthetic data generation framework thatproduces high-quality synthetic data for pretraining. BeyondWeb significantlyextends the capabilities of traditional web-scale datasets, outperformingstate-of-the-art synthetic pretraining datasets such as Cosmopedia andNemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1percentage points (pp) and 2.6pp, respectively, when averaged across a suite of14 benchmark evaluations. It delivers up to 7.7x faster training than open webdata and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for180B tokens on BeyondWeb outperforms an 8B model trained for the same tokenbudget on Cosmopedia. We also present several insights from BeyondWeb onsynthetic data for pretraining: what drives its benefits, which data torephrase and how, and the impact of model size and family on data quality.Overall, our work shows that there's no silver bullet for generatinghigh-quality synthetic pretraining data. The best outcomes require jointlyoptimizing many factors, a challenging task that requires rigorous science andpractical expertise. Naive approaches can yield modest improvements,potentially at great cost, while well-executed methods can yield transformativeimprovements, as exemplified by BeyondWeb.