Reasoning
主流 AI 模型在各任务上的性能指标比较,展示最前沿的技术水平
AI 模型性能基准
主流 AI 模型在各任务上的性能指标比较,展示最前沿的技术水平
ARC
50 篇论文 | 0 个基准测试
Discrete Choice Models
50 篇论文 | 0 个基准测试
3D Human Reconstruction
48 篇论文 | 10 个基准测试
Causal Identification
46 篇论文 | 0 个基准测试
Common Sense Reasoning
45 篇论文 | 24 个基准测试
Task Planning
42 篇论文 | 0 个基准测试
StrategyQA
39 篇论文 | 0 个基准测试
Decision Making Under Uncertainty
38 篇论文 | 0 个基准测试
Temporal Sequences
35 篇论文 | 1 个基准测试
Physical Intuition
33 篇论文 | 1 个基准测试
Assortment Optimization
32 篇论文 | 0 个基准测试
Natural Language Visual Grounding
32 篇论文 | 1 个基准测试
Missing Labels
30 篇论文 | 0 个基准测试
Model-based Reinforcement Learning
30 篇论文 | 0 个基准测试
Abstract Argumentation
25 篇论文 | 0 个基准测试
Zero-Shot Video Question Answer
25 篇论文 | 16 个基准测试
Visual Reasoning
24 篇论文 | 12 个基准测试
Systematic Generalization
22 篇论文 | 0 个基准测试
Decision Making
20 篇论文 | 1 个基准测试
Geometry Problem Solving
20 篇论文 | 0 个基准测试
Odd One Out
20 篇论文 | 1 个基准测试
Video-based Generative Performance Benchmarking
20 篇论文 | 1 个基准测试
Abstract Algebra
18 篇论文 | 1 个基准测试
Program Repair
18 篇论文 | 3 个基准测试
Image Paragraph Captioning
17 篇论文 | 1 个基准测试
Navigate
16 篇论文 | 0 个基准测试
Video-based Generative Performance Benchmarking (Contextual Understanding)
16 篇论文 | 1 个基准测试
Video-based Generative Performance Benchmarking (Correctness of Information)
15 篇论文 | 1 个基准测试
Video-based Generative Performance Benchmarking (Detail Orientation))
15 篇论文 | 1 个基准测试
Video-based Generative Performance Benchmarking (Temporal Understanding)
15 篇论文 | 1 个基准测试
Video-based Generative Performance Benchmarking (Consistency)
15 篇论文 | 1 个基准测试
Date Understanding
14 篇论文 | 0 个基准测试
Visual Commonsense Reasoning
14 篇论文 | 7 个基准测试
Formal Logic
13 篇论文 | 1 个基准测试
Automated Theorem Proving
11 篇论文 | 9 个基准测试
Arithmetic Reasoning
9 篇论文 | 5 个基准测试
Error Understanding
9 篇论文 | 2 个基准测试
Logical Sequence
9 篇论文 | 0 个基准测试
Mathematical Induction
9 篇论文 | 1 个基准测试
Physical Commonsense Reasoning
9 篇论文 | 1 个基准测试
Analogical Similarity
7 篇论文 | 1 个基准测试
Autonomous Web Navigation
7 篇论文 | 0 个基准测试
Causal Judgment
7 篇论文 | 0 个基准测试
Elementary Mathematics
7 篇论文 | 1 个基准测试
Logical Reasoning
7 篇论文 | 10 个基准测试
Theory of Mind Modeling
7 篇论文 | 0 个基准测试
GitHub issue resolution
6 篇论文 | 0 个基准测试
Logical Fallacy Detection
6 篇论文 | 0 个基准测试
Math Word Problem Solving
6 篇论文 | 13 个基准测试
Multimodal Reasoning
6 篇论文 | 3 个基准测试
Visual Entailment
6 篇论文 | 3 个基准测试
Human Judgment Correlation
5 篇论文 | 2 个基准测试
Winowhy
5 篇论文 | 0 个基准测试
Checkmate In One
4 篇论文 | 0 个基准测试
High School Mathematics
4 篇论文 | 1 个基准测试
Penguins In A Table
4 篇论文 | 0 个基准测试
Anachronisms
3 篇论文 | 0 个基准测试
College Mathematics
3 篇论文 | 1 个基准测试
Conformal Prediction
3 篇论文 | 0 个基准测试
Crass AI
3 篇论文 | 1 个基准测试
Reasoning About Colored Objects
3 篇论文 | 0 个基准测试
Analytic Entailment
2 篇论文 | 1 个基准测试
Crash Blossom
2 篇论文 | 1 个基准测试
Entailed Polarity
2 篇论文 | 1 个基准测试
Evaluating Information Essentiality
2 篇论文 | 1 个基准测试
Human Judgment Classification
2 篇论文 | 1 个基准测试
Identify Odd Metapor
2 篇论文 | 1 个基准测试
Logical Args
2 篇论文 | 1 个基准测试
Metaphor Boolean
2 篇论文 | 1 个基准测试
Novel Concepts
2 篇论文 | 0 个基准测试
Presuppositions As NLI
2 篇论文 | 1 个基准测试
Code Line Descriptions
1 篇论文 | 0 个基准测试
Commonsense Reasoning for RL
1 篇论文 | 1 个基准测试
Pre-election ratings estimation
1 篇论文 | 0 个基准测试
Professional Accounting
1 篇论文 | 1 个基准测试