3 months ago

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima

Abstract

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Code Repositories

ai21labs/lm-evaluation

Mentioned in GitHub

conceptofmind/lamda-rlhf-pytorch

pytorch

Mentioned in GitHub

suu990901/LLaMA-InfoEntropy-Loss

jax

Mentioned in GitHub

conceptofmind/LaMDA-pytorch

pytorch

Mentioned in GitHub

EleutherAI/gpt-neo

Mentioned in GitHub

ftramer/lm-extraction-benchmark

Mentioned in GitHub

Wikidepia/indonesia_dataset

Mentioned in GitHub

RossNordby/SoftPromptsForEvaluation

pytorch

Mentioned in GitHub

suu990901/InfoEntropy-Loss

jax

Mentioned in GitHub

alrope123/prompt-waywardness

pytorch

Mentioned in GitHub

google-research/lm-extraction-benchmark

Mentioned in GitHub

EleutherAI/GPTNeo

Mentioned in GitHub

thoppe/personal_cv

Mentioned in GitHub

ncoop57/gpt-code-clippy

jax

Mentioned in GitHub

neutralzz/billa

pytorch

Mentioned in GitHub

THUDM/GLM

pytorch

Mentioned in GitHub

EleutherAI/The-Pile

Official

jackbandy/bookcorpus-datasheet

Mentioned in GitHub

codedotal/gpt-code-clippy

jax

Mentioned in GitHub

yuchuantian/dijiang

pytorch

Mentioned in GitHub

nlpodyssey/verbaflow

Mentioned in GitHub

glassroom/heinsen_attention

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
language-modelling-on-the-pile	GPT-3 Davinci 175B (pre-trained)	Bits per byte: 0.7177
language-modelling-on-the-pile	GPT-2 Medium 355M (pre-trained)	Bits per byte: 1.0928
language-modelling-on-the-pile	GPT-2 XL 1.5B (pre-trained)	Bits per byte: 1.0468
language-modelling-on-the-pile	GPT-2 Large 774M (pre-trained)	Bits per byte: 1.0828
language-modelling-on-the-pile	GPT-3 Curie 6.7B (pre-trained)	Bits per byte: 0.7980
language-modelling-on-the-pile	GPT-2 Small 124M (pre-trained)	Bits per byte: 1.2253
language-modelling-on-the-pile	GPT-3 Ada 350M (pre-trained)	Bits per byte: 0.9631
language-modelling-on-the-pile	GPT-3 Babbage 1.3B (pre-trained)	Bits per byte: 0.8718

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima2 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima