HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Large Language Models Can Self-Improve

Jiaxin Huang Shixiang Shane Gu Le Hou Yuexin Wu Xuezhi Wang Hongkun Yu Jiawei Han

Large Language Models Can Self-Improve

Abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

Benchmarks

BenchmarkMethodologyMetrics
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Consistency)
Accuracy: 74.4
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 32.2
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 82.1
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (CoT Prompting)
Accuracy: 56.5
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 73.5
Parameters (Billion): 540
arithmetic-reasoning-on-gsm8kPaLM 540B (Standard-Prompting)
Accuracy: 17.9
Parameters (Billion): 540
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 88.3
common-sense-reasoning-on-arc-challengePaLM 540B (CoT Prompting)
Accuracy: 85.2
common-sense-reasoning-on-arc-challengePaLM 540B (Standard-Prompting)
Accuracy: 87.1
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, Self Consistency)
Accuracy: 89.8
common-sense-reasoning-on-arc-challengePaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 87.2
common-sense-reasoning-on-arc-challengePaLM 540B (Self Consistency)
Accuracy: 88.7
natural-language-inference-on-anli-testPaLM 540B (Self Consistency)
A2: 64.5
A3: 63.4
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, Self Consistency)
A2: 66.5
A3: 67.9
natural-language-inference-on-anli-testPaLM 540B (CoT Prompting)
A2: 58.9
A3: 60.6
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, Standard-Prompting)
A2: 64.8
A3: 66.9
natural-language-inference-on-anli-testPaLM 540B (Standard-Prompting)
A2: 55.8
A3: 55.8
natural-language-inference-on-anli-testPaLM 540B (Self Improvement, CoT Prompting)
A2: 65.3
A3: 67.3
question-answering-on-dropPaLM 540B (Self Consistency)
Accuracy: 78.2
question-answering-on-dropPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 83
question-answering-on-dropPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 71.7
question-answering-on-dropPaLM 540B (Standard-Prompting)
Accuracy: 60
question-answering-on-dropPaLM 540B (CoT Prompting)
Accuracy: 70.6
question-answering-on-dropPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 76.2
question-answering-on-openbookqaPaLM 540B (Standard-Prompting)
Accuracy: 84.4
question-answering-on-openbookqaPaLM 540B (CoT Prompting)
Accuracy: 86.4
question-answering-on-openbookqaPaLM 540B (Self Improvement, Self Consistency)
Accuracy: 94.4
question-answering-on-openbookqaPaLM 540B (Self Improvement, CoT Prompting)
Accuracy: 93
question-answering-on-openbookqaPaLM 540B (Self Improvement, Standard-Prompting)
Accuracy: 92
question-answering-on-openbookqaPaLM 540B (Self Consistency)
Accuracy: 90

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp