HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Language Models are Few-Shot Learners

Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

Language Models are Few-Shot Learners

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Code Repositories

ai21labs/lm-evaluation
tf
Mentioned in GitHub
juletx/lm-evaluation-harness
pytorch
Mentioned in GitHub
um-arm-lab/efficient-eng-2-ltl
pytorch
Mentioned in GitHub
haiyang-w/git
pytorch
Mentioned in GitHub
neuralmagic/lm-evaluation-harness
pytorch
Mentioned in GitHub
shreyashankar/gpt3-sandbox
Mentioned in GitHub
EightRice/atn_GPT-3
tf
Mentioned in GitHub
EleutherAI/gpt-neo
tf
Mentioned in GitHub
fywalter/label-bias
pytorch
Mentioned in GitHub
hazyresearch/ama_prompting
Mentioned in GitHub
openai/gpt-3
Official
Mentioned in GitHub
RUCAIBox/LLMBox
Mentioned in GitHub
allenai/macaw
pytorch
Mentioned in GitHub
crazydigger/Callibration-of-GPT
pytorch
Mentioned in GitHub
smile-data/smile
pytorch
Mentioned in GitHub
karpathy/build-nanogpt
pytorch
Mentioned in GitHub
volcengine/vegiantmodel
pytorch
Mentioned in GitHub
asahi417/relbert
Mentioned in GitHub
openbiolink/promptsource
Mentioned in GitHub
facebookresearch/anli
pytorch
Mentioned in GitHub
ramanakshay/nanogpt
pytorch
Mentioned in GitHub
vilm-ai/viet-llm-eval
jax
Mentioned in GitHub
lambert-x/prolab
pytorch
Mentioned in GitHub
NVIDIA/NeMo-Curator
Mentioned in GitHub
scrayish/ML_NLP
pytorch
Mentioned in GitHub
EleutherAI/lm_evaluation_harness
jax
Mentioned in GitHub
smarton-empower/smarton-ai
Mentioned in GitHub
ncoop57/gpt-code-clippy
jax
Mentioned in GitHub
VachanVY/gpt.jax
jax
Mentioned in GitHub
nlx-group/overlapy
Mentioned in GitHub
ggml-org/llama.cpp
pytorch
Mentioned in GitHub
ggerganov/llama.cpp
pytorch
Mentioned in GitHub
sambanova/lm-evaluation-harness
jax
Mentioned in GitHub
codedotal/gpt-code-clippy
jax
Mentioned in GitHub
grantslatton/llama.cpp
Mentioned in GitHub
postech-ami/smile-dataset
pytorch
Mentioned in GitHub
Sypherd/lm-evaluation-harness
pytorch
Mentioned in GitHub
x-lance/neusym-rag
Mentioned in GitHub
hilberthit/gpt-3
Mentioned in GitHub
tonyzhaozh/few-shot-learning
pytorch
Mentioned in GitHub
gmum/dl-mo-2021
Mentioned in GitHub
zphang/lm_evaluation_harness
Mentioned in GitHub
contextlab/abstract2paper
Mentioned in GitHub
turkunlp/megatron-deepspeed
pytorch
Mentioned in GitHub
karpathy/llm.c
pytorch
Mentioned in GitHub
ethanjperez/true_few_shot
pytorch
Mentioned in GitHub
longhao-chen/aicas2024
pytorch
Mentioned in GitHub
EleutherAI/lm-evaluation-harness
jax
Mentioned in GitHub
opengptx/lm-evaluation-harness
pytorch
Mentioned in GitHub
bigscience-workshop/Megatron-DeepSpeed
pytorch
Mentioned in GitHub
asahi417/lmppl
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
answerability-prediction-on-peerqaGPT-3.5-Turbo-0613-16k
Macro F1: 0.3304
common-sense-reasoning-on-arc-challengeGPT-3 175B (0-shot)
Accuracy: 51.4
common-sense-reasoning-on-arc-challengeGPT-3 175B (1 shot)
Accuracy: 53.2
common-sense-reasoning-on-arc-easyGPT-3 175B (1 shot)
Accuracy: 71.2
common-sense-reasoning-on-arc-easyGPT-3 175B (0-shot)
Accuracy: 68.8
common-sense-reasoning-on-recordGPT-3 Large 760M (0-shot)
EM: 82.1
common-sense-reasoning-on-winograndeGPT-3 Large 760M (0-shot)
Accuracy: 57.4
common-sense-reasoning-on-winograndeGPT-3 175B (0-shot)
Accuracy: 70.2
coreference-resolution-on-winograd-schemaGPT-3 175B (few-shot)
Accuracy: 80.1
few-shot-learning-on-medconceptsqagpt-3.5-turbo
Accuracy: 41.476
language-modelling-on-lambadaGPT-3 175B (Few-Shot)
Accuracy: 86.4
Perplexity: 1.92
language-modelling-on-lambadaGPT-3 13B (Zero-Shot)
Accuracy: 72.5
Perplexity: 3.56
language-modelling-on-lambadaGPT-3 2.7B (Zero-Shot)
Accuracy: 67.1
Perplexity: 4.60
language-modelling-on-lambadaGPT-3 6.7B (Zero-Shot)
Accuracy: 70.3
Perplexity: 4.00
language-modelling-on-lambadaGPT-3 175B (Zero-Shot)
Accuracy: 76.2
Perplexity: 3.00
language-modelling-on-penn-treebank-wordGPT-3 (Zero-Shot)
Params: 175000M
Test perplexity: 20.5
multi-task-language-understanding-on-mmluGPT-3 175B (5-shot)
Average (%): 43.9
natural-language-inference-on-anli-testGPT-3
A1: 36.8
A2: 34
A3: 40.2
natural-language-inference-on-commitmentbankGPT-3 175B (Few-Shot)
Accuracy: 75.6
natural-language-inference-on-commitmentbankGPT-3 175B (few-shot, k=32)
F1: 52
natural-language-inference-on-rteGPT-3 175B (few-shot, k=32)
Accuracy: 69%
question-answering-on-boolqGPT-3 175B (few-shot, k=32)
Accuracy: 76.4
question-answering-on-boolqGPT-3 75B (0-shot)
Accuracy: 60.5
question-answering-on-copaGPT-3 175B (few-shot, k=32)
Accuracy: 92
question-answering-on-copaGPT-3 Large 760M (0-shot)
Accuracy: 73.0
question-answering-on-copaGPT-3 13B (few-shot, k=32)
Accuracy: 86
question-answering-on-copaGPT-3 175B (0-shot)
Accuracy: 91
question-answering-on-copaGPT-3 175B (1-shot)
Accuracy: 87
question-answering-on-coqaGPT-3 175B (few-shot, k=32)
Overall: 85
question-answering-on-drop-testGPT-3 175B (few-shot, k=32)
F1: 36.5
question-answering-on-multircGPT-3 175B (Few-Shot)
F1: 75.4
question-answering-on-natural-questionsGPT-3 175B (Few-Shot, k=64)
EM: 29.9
question-answering-on-obqaGPT-3 175B (zero-shot)
Accuracy: 57.6
question-answering-on-openbookqaGPT-3 175B (few-shot, k=32)
Accuracy: 65.4
question-answering-on-peerqaGPT-3.5-Turbo-0613-16k
AlignScore: 0.1378
Prometheus-2 Answer Correctness: 3.0408
Rouge-L: 0.2414
question-answering-on-piqaGPT-3 175B (0-shot)
Accuracy: 81.0
question-answering-on-piqaGPT-3 Large 760M (0-shot)
Accuracy: 72.9
question-answering-on-quacGPT-3 175B (few-shot, k=32)
F1: 44.3
question-answering-on-raceGPT-3 175B (few-shot, k=32)
RACE-m: 58.1
question-answering-on-raceGPT-3 175B (Few-Shot)
RACE-h: 46.8
question-answering-on-story-clozeGPT-3 175B (Few-Shot)
Accuracy: 87.7
question-answering-on-storyclozeGPT-3 Large 760M (zero-shot)
Accuracy: 72.4
question-answering-on-triviaqaGPT-3 175B (Few-Shot)
EM: 71.2
question-answering-on-webquestionsGPT-3-175B (Few-Shot)
EM: 41.5
question-answering-on-webquestionsGPT-3-175B (Zero-Shot)
EM: 14.4
question-answering-on-webquestionsGPT-3-175B (One-Shot)
EM: 25.3
question-answering-on-webquestionsFew-shot
EM: 44.7
reading-comprehension-on-raceGPT-3 175B (zero-shot)
Accuracy (High): 45.5
reading-comprehension-on-raceGPT-3 175B (0-shot)
Accuracy (Middle): 58.4
unsupervised-machine-translation-on-wmt2014-1GPT-3 175B (Few-Shot)
BLEU: 39.2
unsupervised-machine-translation-on-wmt2014-2GPT-3 175B (Few-Shot)
BLEU: 32.6
unsupervised-machine-translation-on-wmt2016GPT-3 175B (Few-Shot)
BLEU: 29.7
unsupervised-machine-translation-on-wmt2016-1GPT-3 175B (Few-Shot)
BLEU: 40.6
unsupervised-machine-translation-on-wmt2016-2GPT-3 175B (Few-Shot)
BLEU: 21
unsupervised-machine-translation-on-wmt2016-3GPT-3 175B (Few-Shot)
BLEU: 39.5
word-sense-disambiguation-on-words-in-contextGPT-3 175B (few-shot, k=32)
Accuracy: 49.4
zero-shot-learning-on-medconceptsqagpt-3.5-turbo
Accuracy: 37.058

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp