Command Palette
Search for a command to run...

Abstract
Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| few-shot-text-classification-on-raft | GPT-3 zero-shot | Over: 0.378 ADE: 0.163 Avg: 0.292 B77: 0.000 NIS: 0.572 OSE: 0.323 SOT: 0.628 SRI: 0.027 TAI: 0.362 TC: 0.290 TEH: 0.303 ToS: 0.164 |
| few-shot-text-classification-on-raft | Plurality-class | Over: 0.337 ADE: 0.446 Avg: 0.331 B77: 0.000 NIS: 0.353 OSE: 0.164 SOT: 0.271 SRI: 0.493 TAI: 0.344 TC: 0.391 TEH: 0.366 ToS: 0.471 |
| few-shot-text-classification-on-raft | GPT-2 | Over: 0.498 ADE: 0.600 Avg: 0.458 B77: 0.121 NIS: 0.561 OSE: 0.245 SOT: 0.380 SRI: 0.492 TAI: 0.612 TC: 0.723 TEH: 0.311 ToS: 0.498 |
| few-shot-text-classification-on-raft | AdaBoost | Over: 0.838 ADE: 0.543 Avg: 0.514 B77: 0.023 NIS: 0.626 OSE: 0.475 SOT: 0.455 SRI: 0.506 TAI: 0.556 TC: 0.625 TEH: 0.443 ToS: 0.560 |
| few-shot-text-classification-on-raft | BART MNLI zero-shot | Over: 0.462 ADE: 0.234 Avg: 0.382 B77: 0.332 NIS: 0.615 OSE: 0.360 SOT: 0.644 SRI: 0.026 TAI: 0.469 TC: 0.400 TEH: 0.543 ToS: 0.122 |
| few-shot-text-classification-on-raft | GPT-3 | Over: 0.937 ADE: 0.686 Avg: 0.627 B77: 0.299 NIS: 0.679 OSE: 0.431 SOT: 0.769 SRI: 0.516 TAI: 0.656 TC: 0.821 TEH: 0.526 ToS: 0.574 |
| few-shot-text-classification-on-raft | GPT-Neo | Over: 0.681 ADE: 0.452 Avg: 0.481 B77: 0.149 NIS: 0.408 OSE: 0.343 SOT: 0.406 SRI: 0.493 TAI: 0.605 TC: 0.636 TEH: 0.554 ToS: 0.565 |
| few-shot-text-classification-on-raft | Human (crowdsourced) | Over: 0.917 ADE: 0.830 Avg: 0.735 B77: 0.607 NIS: 0.857 OSE: 0.646 SOT: 0.908 SRI: 0.468 TAI: 0.609 TC: 0.897 TEH: 0.722 ToS: 0.627 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.