HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Finetuned Language Models Are Zero-Shot Learners

Jason Wei Maarten Bosma Vincent Y. Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le

Finetuned Language Models Are Zero-Shot Learners

Abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Code Repositories

hiyouga/llama-efficient-tuning
pytorch
Mentioned in GitHub
bigcode-project/starcoder
pytorch
Mentioned in GitHub
openbiolink/promptsource
Mentioned in GitHub
google-research/flan
Official
tf
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
common-sense-reasoning-on-arc-challengeFLAN 137B (zero-shot)
Accuracy: 63.1
common-sense-reasoning-on-arc-challengeFLAN 137B (few-shot, k=13)
Accuracy: 63.8
common-sense-reasoning-on-arc-easyFLAN 137B (few-shot, k=14)
Accuracy: 80.7
common-sense-reasoning-on-arc-easyFLAN 137B (0-shot)
Accuracy: 79.6
common-sense-reasoning-on-recordFLAN 137B (zero-shot)
EM: 72.5
common-sense-reasoning-on-recordFLAN 137B (prompt-tuned)
EM: 85.1
common-sense-reasoning-on-winograndeFLAN 137B (few-shot, k=16)
Accuracy: 72.8
common-sense-reasoning-on-winograndeFLAN 137B (0-shot)
Accuracy: 71.2
coreference-resolution-on-winograd-schemaFLAN 137B (prompt-tuned)
Accuracy: 86.5
coreference-resolution-on-winograd-schemaFLAN 137B (zero-shot)
Accuracy: 80.8
machine-translation-on-wmt2014-english-frenchFLAN 137B (few-shot, k=9)
BLEU score: 33.8
machine-translation-on-wmt2014-english-frenchFLAN 137B (zero-shot)
BLEU score: 33.9
machine-translation-on-wmt2014-french-englishFLAN 137B (few-shot, k=9)
BLEU score: 37.9
machine-translation-on-wmt2014-french-englishFLAN 137B (zero-shot)
BLEU score: 35.9
machine-translation-on-wmt2016-english-1FLAN 137B (few-shot, k=9)
BLEU score: 20.5
machine-translation-on-wmt2016-english-1FLAN 137B (zero-shot)
BLEU score: 18.9
machine-translation-on-wmt2016-english-germanFLAN 137B (few-shot, k=11)
BLEU score: 26.1
machine-translation-on-wmt2016-english-germanFLAN 137B (zero-shot)
BLEU score: 27.0
machine-translation-on-wmt2016-german-englishFLAN 137B (zero-shot)
BLEU score: 38.9
machine-translation-on-wmt2016-german-englishFLAN 137B (few-shot, k=11)
BLEU score: 40.7
machine-translation-on-wmt2016-romanianFLAN 137B (few-shot, k=9)
BLEU score: 38.1
machine-translation-on-wmt2016-romanianFLAN 137B (zero-shot)
BLEU score: 37.3
natural-language-inference-on-rteFLAN 137B (8-shot)
Accuracy: 84.5%
natural-language-inference-on-rteFLAN 137B (0-shot)
Accuracy: 84.1%
natural-language-inference-on-rteFLAN 137B (prompt-tuned)
Accuracy: 91.7%
natural-language-inference-on-wnliFLAN 137B (few-shot, k=4)
Accuracy: 70.4
natural-language-inference-on-wnliFLAN 137B (zero-shot)
Accuracy: 74.6
question-answering-on-boolqFLAN 137B (4-shot)
Accuracy: 84.6
question-answering-on-boolqFLAN 137B (0-shot)
Accuracy: 82.9
question-answering-on-boolqFLAN 137B (prompt-tuned)
Accuracy: 86.3
question-answering-on-copaFLAN 137B (prompt-tuned)
Accuracy: 94
question-answering-on-copaFLAN 137B (zero-shot)
Accuracy: 91
question-answering-on-copaFLAN 137B (few-shot, k=16)
Accuracy: 87
question-answering-on-multircFLAN 137B (1-shot)
F1: 72.1
question-answering-on-multircFLAN 137B (prompt-tuned)
F1: 83.4
question-answering-on-multircFLAN 137B (zero-shot)
F1: 77.5
question-answering-on-naturalqaFLAN 137B (zero-shot)
EM: 20.7
question-answering-on-obqaFLAN 137B (few-shot, k=16)
Accuracy: 78.2
question-answering-on-obqaFLAN 137B (zero-shot)
Accuracy: 78.4
question-answering-on-piqaFLAN 137B (few-shot, k=10)
Accuracy: 81.7
question-answering-on-piqaFLAN 137B (0-shot)
Accuracy: 80.5
question-answering-on-storyclozeFLAN 137B (few-shot, k=10)
Accuracy: 94.7
question-answering-on-storyclozeFLAN 137B (zero-shot)
Accuracy: 93.4
question-answering-on-triviaqaFLAN 137B (zero-shot)
EM: 56.7
sentiment-analysis-on-imdbFLAN 137B (zero-shot)
Accuracy: 94.3
sentiment-analysis-on-imdbFLAN 137B (few-shot, k=2)
Accuracy: 95

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp