Command Palette
Search for a command to run...
Jason WeiMaarten BosmaVincent Y. ZhaoKelvin GuuAdams Wei YuBrian LesterNan DuAndrew M. DaiQuoc V. Le

摘要
本文提出了一种简单有效的方法,用于提升语言模型的零样本学习能力。我们证明,通过使用自然语言指令模板描述任务集合对语言模型进行指令微调(instruction tuning),可显著提升模型在未见任务上的零样本性能。我们以一个拥有1370亿参数的预训练语言模型为基础,在超过60个自然语言处理任务上对其进行指令微调,这些任务均以自然语言指令模板的形式进行表述。我们将其微调后的模型命名为FLAN,并在未见过的任务类型上对其进行评估。结果表明,FLAN在性能上显著优于未经修改的原始模型,并在所评估的25项任务中,有20项超越了1750亿参数的零样本GPT-3。此外,FLAN在ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA和StoryCloze等任务上,甚至大幅领先于少样本学习的GPT-3。消融实验进一步揭示,微调数据集的数量、模型规模以及自然语言指令的设计,是指令微调取得成功的关键因素。
代码仓库
hiyouga/llama-efficient-tuning
pytorch
GitHub 中提及
hojjat-mokhtarabadi/promptsource
GitHub 中提及
bigcode-project/starcoder
pytorch
GitHub 中提及
openbiolink/promptsource
GitHub 中提及
MS-P3/code6/tree/main/finetune
mindspore
google-research/flan
官方
tf
GitHub 中提及
bigscience-workshop/promptsource
GitHub 中提及
ukplab/arxiv2025-inherent-limits-plms
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 | 
|---|---|---|
| common-sense-reasoning-on-arc-challenge | FLAN 137B (zero-shot) | Accuracy: 63.1  | 
| common-sense-reasoning-on-arc-challenge | FLAN 137B (few-shot, k=13) | Accuracy: 63.8  | 
| common-sense-reasoning-on-arc-easy | FLAN 137B (few-shot, k=14) | Accuracy: 80.7  | 
| common-sense-reasoning-on-arc-easy | FLAN 137B (0-shot) | Accuracy: 79.6  | 
| common-sense-reasoning-on-record | FLAN 137B (zero-shot) | EM: 72.5  | 
| common-sense-reasoning-on-record | FLAN 137B (prompt-tuned) | EM: 85.1  | 
| common-sense-reasoning-on-winogrande | FLAN 137B (few-shot, k=16) | Accuracy: 72.8  | 
| common-sense-reasoning-on-winogrande | FLAN 137B (0-shot) | Accuracy: 71.2  | 
| coreference-resolution-on-winograd-schema | FLAN 137B (prompt-tuned) | Accuracy: 86.5  | 
| coreference-resolution-on-winograd-schema | FLAN 137B (zero-shot) | Accuracy: 80.8  | 
| machine-translation-on-wmt2014-english-french | FLAN 137B (few-shot, k=9) | BLEU score: 33.8  | 
| machine-translation-on-wmt2014-english-french | FLAN 137B (zero-shot) | BLEU score: 33.9  | 
| machine-translation-on-wmt2014-french-english | FLAN 137B (few-shot, k=9) | BLEU score: 37.9  | 
| machine-translation-on-wmt2014-french-english | FLAN 137B (zero-shot) | BLEU score: 35.9  | 
| machine-translation-on-wmt2016-english-1 | FLAN 137B (few-shot, k=9) | BLEU score: 20.5  | 
| machine-translation-on-wmt2016-english-1 | FLAN 137B (zero-shot) | BLEU score: 18.9  | 
| machine-translation-on-wmt2016-english-german | FLAN 137B (few-shot, k=11) | BLEU score: 26.1  | 
| machine-translation-on-wmt2016-english-german | FLAN 137B (zero-shot) | BLEU score: 27.0  | 
| machine-translation-on-wmt2016-german-english | FLAN 137B (zero-shot) | BLEU score: 38.9  | 
| machine-translation-on-wmt2016-german-english | FLAN 137B (few-shot, k=11) | BLEU score: 40.7  | 
| machine-translation-on-wmt2016-romanian | FLAN 137B (few-shot, k=9) | BLEU score: 38.1  | 
| machine-translation-on-wmt2016-romanian | FLAN 137B (zero-shot) | BLEU score: 37.3  | 
| natural-language-inference-on-rte | FLAN 137B (8-shot) | Accuracy: 84.5%  | 
| natural-language-inference-on-rte | FLAN 137B (0-shot) | Accuracy: 84.1%  | 
| natural-language-inference-on-rte | FLAN 137B (prompt-tuned) | Accuracy: 91.7%  | 
| natural-language-inference-on-wnli | FLAN 137B (few-shot, k=4) | Accuracy: 70.4  | 
| natural-language-inference-on-wnli | FLAN 137B (zero-shot) | Accuracy: 74.6  | 
| question-answering-on-boolq | FLAN 137B (4-shot) | Accuracy: 84.6  | 
| question-answering-on-boolq | FLAN 137B (0-shot) | Accuracy: 82.9  | 
| question-answering-on-boolq | FLAN 137B (prompt-tuned) | Accuracy: 86.3  | 
| question-answering-on-copa | FLAN 137B (prompt-tuned) | Accuracy: 94  | 
| question-answering-on-copa | FLAN 137B (zero-shot) | Accuracy: 91  | 
| question-answering-on-copa | FLAN 137B (few-shot, k=16) | Accuracy: 87  | 
| question-answering-on-multirc | FLAN 137B (1-shot) | F1: 72.1  | 
| question-answering-on-multirc | FLAN 137B (prompt-tuned) | F1: 83.4  | 
| question-answering-on-multirc | FLAN 137B (zero-shot) | F1: 77.5  | 
| question-answering-on-naturalqa | FLAN 137B (zero-shot) | EM: 20.7  | 
| question-answering-on-obqa | FLAN 137B (few-shot, k=16) | Accuracy: 78.2  | 
| question-answering-on-obqa | FLAN 137B (zero-shot) | Accuracy: 78.4  | 
| question-answering-on-piqa | FLAN 137B (few-shot, k=10) | Accuracy: 81.7  | 
| question-answering-on-piqa | FLAN 137B (0-shot) | Accuracy: 80.5  | 
| question-answering-on-storycloze | FLAN 137B (few-shot, k=10) | Accuracy: 94.7  | 
| question-answering-on-storycloze | FLAN 137B (zero-shot) | Accuracy: 93.4  | 
| question-answering-on-triviaqa | FLAN 137B (zero-shot) | EM: 56.7  | 
| sentiment-analysis-on-imdb | FLAN 137B (zero-shot) | Accuracy: 94.3  | 
| sentiment-analysis-on-imdb | FLAN 137B (few-shot, k=2) | Accuracy: 95  |