HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

RAFT: A Real-World Few-Shot Text Classification Benchmark

RAFT: A Real-World Few-Shot Text Classification Benchmark

Abstract

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
few-shot-text-classification-on-raftGPT-3 zero-shot
Over: 0.378
ADE: 0.163
Avg: 0.292
B77: 0.000
NIS: 0.572
OSE: 0.323
SOT: 0.628
SRI: 0.027
TAI: 0.362
TC: 0.290
TEH: 0.303
ToS: 0.164
few-shot-text-classification-on-raftPlurality-class
Over: 0.337
ADE: 0.446
Avg: 0.331
B77: 0.000
NIS: 0.353
OSE: 0.164
SOT: 0.271
SRI: 0.493
TAI: 0.344
TC: 0.391
TEH: 0.366
ToS: 0.471
few-shot-text-classification-on-raftGPT-2
Over: 0.498
ADE: 0.600
Avg: 0.458
B77: 0.121
NIS: 0.561
OSE: 0.245
SOT: 0.380
SRI: 0.492
TAI: 0.612
TC: 0.723
TEH: 0.311
ToS: 0.498
few-shot-text-classification-on-raftAdaBoost
Over: 0.838
ADE: 0.543
Avg: 0.514
B77: 0.023
NIS: 0.626
OSE: 0.475
SOT: 0.455
SRI: 0.506
TAI: 0.556
TC: 0.625
TEH: 0.443
ToS: 0.560
few-shot-text-classification-on-raftBART MNLI zero-shot
Over: 0.462
ADE: 0.234
Avg: 0.382
B77: 0.332
NIS: 0.615
OSE: 0.360
SOT: 0.644
SRI: 0.026
TAI: 0.469
TC: 0.400
TEH: 0.543
ToS: 0.122
few-shot-text-classification-on-raftGPT-3
Over: 0.937
ADE: 0.686
Avg: 0.627
B77: 0.299
NIS: 0.679
OSE: 0.431
SOT: 0.769
SRI: 0.516
TAI: 0.656
TC: 0.821
TEH: 0.526
ToS: 0.574
few-shot-text-classification-on-raftGPT-Neo
Over: 0.681
ADE: 0.452
Avg: 0.481
B77: 0.149
NIS: 0.408
OSE: 0.343
SOT: 0.406
SRI: 0.493
TAI: 0.605
TC: 0.636
TEH: 0.554
ToS: 0.565
few-shot-text-classification-on-raftHuman (crowdsourced)
Over: 0.917
ADE: 0.830
Avg: 0.735
B77: 0.607
NIS: 0.857
OSE: 0.646
SOT: 0.908
SRI: 0.468
TAI: 0.609
TC: 0.897
TEH: 0.722
ToS: 0.627

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp