Arithmetic Reasoning On Gsm8K

评估指标

Accuracy

Parameters (Billion)

评测结果

各个模型在此基准测试上的表现结果

			Paper Title	Repository
Claude 3.5 Sonnet (HPT)	97.72	-	Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Qwen2-Math-72B-Instruct (greedy)	96.7	72	Qwen2 Technical Report
SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)	96.4	7	-	-
OpenMath2-Llama3.1-70B (majority@256)	96.0	-	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Jiutian-大模型	95.2	75	-	-
DAMOMath-7B(MetaMath, OVM, BS, Ensemble)	95.1	7	-	-
Claude 3 Opus (0-shot chain-of-thought)	95	-	The Claude 3 Model Family: Opus, Sonnet, Haiku	-
OpenMath2-Llama3.1-70B	94.9	-	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
GPT-4 (Teaching-Inspired)	94.8	-	Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
SFT-Mistral-7B (Metamath + ovm +ensemble)	94.13	7	-	-
OpenMath2-Llama3.1-8B (majority@256)	94.1	-	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Qwen2-72B-Instruct-Step-DPO (0-shot CoT)	94.0	-	Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
DAMOMath-7B(MetaMath, OVM, Ensemble)	93.2	7	-	-
Claude 3 Sonnet (0-shot chain-of-thought)	92.3	-	The Claude 3 Model Family: Opus, Sonnet, Haiku	-
AlphaLLM (with MCTS)	92	70	Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
OpenMath2-Llama3.1-8B	91.7	-	OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
PaLM 2 (few-shot, k=8, SC)	91.0	-	PaLM 2 Technical Report
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	90.91	-	Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
OpenMath-CodeLlama-70B (w/ code, SC, k=50)	90.8	70	OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	90.4	70	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

0 of 160 row(s) selected.

Command Palette

Arithmetic Reasoning On Gsm8K

评估指标

评测结果