Command Palette
Search for a command to run...
Hyung Won Chung; Le Hou; Shayne Longpre; Barret Zoph; Yi Tay; William Fedus; Yunxuan Li; Xuezhi Wang; Mostafa Dehghani; Siddhartha Brahma; Albert Webson; Shixiang Shane Gu; Zhuyun Dai; Mirac Suzgun; Xinyun Chen; Aakanksha Chowdhery; Alex Castro-Ros; Marie Pellat; Kevin Robinson; Dasha Valter; Sharan Narang; Gaurav Mishra; Adams Yu; Vincent Zhao; Yanping Huang; Andrew Dai; Hongkun Yu; Slav Petrov; Ed H. Chi; Jeff Dean; Jacob Devlin; Adam Roberts; Denny Zhou; Quoc V. Le; Jason Wei

摘要
通过在一系列以指令形式表述的数据集上对语言模型进行微调,已被证明可以提高模型性能并增强其对未见过任务的泛化能力。本文特别探讨了指令微调的三个方面:(1)扩展任务数量,(2)扩展模型规模,以及(3)基于链式思维数据的微调。研究发现,结合上述方面的指令微调显著提升了多种模型类别(PaLM、T5、U-PaLM)、提示设置(零样本、少样本、CoT)和评估基准(MMLU、BBH、TyDiQA、MGSM、开放式生成)上的性能。例如,经过1.8K个任务指令微调的Flan-PaLM 540B在多个评估指标上大幅超越了PaLM 540B(平均提升9.4%)。Flan-PaLM 540B在五次提示的MMLU基准测试中达到了75.2%的准确率,实现了当前最佳性能。此外,我们还公开发布了Flan-T5检查点,这些检查点即使与更大规模的模型(如PaLM 62B)相比也表现出强大的少样本性能。总体而言,指令微调是一种普遍适用的方法,能够有效提升预训练语言模型的性能和可用性。
代码仓库
declare-lab/flan-alpaca
pytorch
GitHub 中提及
joelniklaus/lawinstruct
GitHub 中提及
formulamonks/llm-benchmarker-suite
pytorch
GitHub 中提及
google-research/flan
tf
GitHub 中提及
theoremone/llm-benchmarker-suite
pytorch
GitHub 中提及
zchuz/timebench
GitHub 中提及
kapllan/zeroshot_lexglue
GitHub 中提及
coastalcph/zeroshot_lexglue
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 | 
|---|---|---|
| coreference-resolution-on-winograd-schema | Flan-T5 XXL (zero -shot) | Accuracy: 89.82  | 
| cross-lingual-question-answering-on-tydiqa | Flan-PaLM 540B (direct-prompting) | EM: 67.8  | 
| cross-lingual-question-answering-on-tydiqa | Flan-U-PaLM 540B (direct-prompting) | EM: 68.3  | 
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 61.3  | 
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT) | Average (%): 57.6  | 
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 66.5  | 
| multi-task-language-understanding-on-bbh-alg | PaLM 540B | Average (%): 38.3  | 
| multi-task-language-understanding-on-bbh-alg | Flan-PaLM 540B (3-shot, fine-tuned) | Average (%): 48.2  | 
| multi-task-language-understanding-on-bbh-alg | PaLM 540B (CoT + self-consistency) | Average (%): 62.2  | 
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT) | Average (%): 71.2  | 
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B | Average (%): 62.7  | 
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (5-shot, finetuned) | Average (%): 70.0  | 
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | Average (%): 78.4  | 
| multi-task-language-understanding-on-bbh-nlp | PaLM 540B (CoT + self-consistency) | Average (%): 78.2  | 
| multi-task-language-understanding-on-bbh-nlp | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | Average (%): 72.4  | 
| multi-task-language-understanding-on-mgsm | Flan-U-PaLM 540B (CoT) | Average (%): 60.4  | 
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC) | Average (%): 72.0  | 
| multi-task-language-understanding-on-mgsm | code-davinci-002 | Average (%): 35  | 
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned, CoT) | Average (%): 57.0  | 
| multi-task-language-understanding-on-mgsm | GPT-3 Davinci 175B | Average (%): 5.7  | 
| multi-task-language-understanding-on-mgsm | text-davinci-003 | Average (%): 36  | 
| multi-task-language-understanding-on-mgsm | Flan-PaLM 540B (8-shot, fine-tuned) | Average (%): 21.2  | 
| multi-task-language-understanding-on-mgsm | text-davinci-002 | Average (%): 23.7  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M (CoT) | Average (%): 33.7  | 
| multi-task-language-understanding-on-mmlu | llama 2(65b) | Average (%): 73.5  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-Small 80M | Average (%): 28.7  | 
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (CoT) | Average (%): 59.5  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M | Average (%): 45.1  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-XL 3B (CoT) | Average (%): 45.5  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-Base 250M | Average (%): 35.9  | 
| multi-task-language-understanding-on-mmlu | Flan-PaLM (5-shot, finetuned) | Average (%): 72.2  | 
| multi-task-language-understanding-on-mmlu | Flan-T5-Large 780M (CoT) | Average (%): 40.5  | 
| multi-task-language-understanding-on-mmlu | GPT-3 Davinci 175B (5-shot) | Average (%): 39.7  |