Command Palette
Search for a command to run...

Abstract
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| language-modelling-on-the-pile | GPT-3 Davinci 175B (pre-trained) | Bits per byte: 0.7177 |
| language-modelling-on-the-pile | GPT-2 Medium 355M (pre-trained) | Bits per byte: 1.0928 |
| language-modelling-on-the-pile | GPT-2 XL 1.5B (pre-trained) | Bits per byte: 1.0468 |
| language-modelling-on-the-pile | GPT-2 Large 774M (pre-trained) | Bits per byte: 1.0828 |
| language-modelling-on-the-pile | GPT-3 Curie 6.7B (pre-trained) | Bits per byte: 0.7980 |
| language-modelling-on-the-pile | GPT-2 Small 124M (pre-trained) | Bits per byte: 1.2253 |
| language-modelling-on-the-pile | GPT-3 Ada 350M (pre-trained) | Bits per byte: 0.9631 |
| language-modelling-on-the-pile | GPT-3 Babbage 1.3B (pre-trained) | Bits per byte: 0.8718 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.