HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
Command Palette
Search for a command to run...
首页
SOTA
算术推理
Arithmetic Reasoning On Gsm8K
Arithmetic Reasoning On Gsm8K
评估指标
Accuracy
Parameters (Billion)
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Parameters (Billion)
Paper Title
Repository
Claude 3.5 Sonnet (HPT)
97.72
-
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Qwen2-Math-72B-Instruct (greedy)
96.7
72
Qwen2 Technical Report
SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)
96.4
7
-
-
OpenMath2-Llama3.1-70B (majority@256)
96.0
-
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Jiutian-大模型
95.2
75
-
-
DAMOMath-7B(MetaMath, OVM, BS, Ensemble)
95.1
7
-
-
Claude 3 Opus (0-shot chain-of-thought)
95
-
The Claude 3 Model Family: Opus, Sonnet, Haiku
-
OpenMath2-Llama3.1-70B
94.9
-
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
GPT-4 (Teaching-Inspired)
94.8
-
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
SFT-Mistral-7B (Metamath + ovm +ensemble)
94.13
7
-
-
OpenMath2-Llama3.1-8B (majority@256)
94.1
-
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Qwen2-72B-Instruct-Step-DPO (0-shot CoT)
94.0
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
DAMOMath-7B(MetaMath, OVM, Ensemble)
93.2
7
-
-
Claude 3 Sonnet (0-shot chain-of-thought)
92.3
-
The Claude 3 Model Family: Opus, Sonnet, Haiku
-
AlphaLLM (with MCTS)
92
70
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
OpenMath2-Llama3.1-8B
91.7
-
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
PaLM 2 (few-shot, k=8, SC)
91.0
-
PaLM 2 Technical Report
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)
90.91
-
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
OpenMath-CodeLlama-70B (w/ code, SC, k=50)
90.8
70
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)
90.4
70
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
0 of 160 row(s) selected.
Previous
Next
Arithmetic Reasoning On Gsm8K | SOTA | HyperAI超神经