Multi Task Language Understanding On Mmlu

Average (%)

评测结果

各个模型在此基准测试上的表现结果

		Paper Title	Repository
Claude 3.5 Sonnet (5-shot)	88.7	Claude 3.5 Sonnet Model Card Addendum	-
ds-r1(671b)	87.5	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
GPT-4 o1(300b)	87	GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data	-
Llama 3.1 (405B)	86.6	Llama 3 Meets MoE: Efficient Upcycling
Llama 3.1 (70B)	86.0	Llama 3 Meets MoE: Efficient Upcycling
Gemini Ultra (5-shot)	83.7	-	-
Qwen2-72B-Instruct	83.54	Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Claude 3 Sonnet (5-shot)	79	The Claude 3 Model Family: Opus, Sonnet, Haiku	-
Qwen1.5 72B (5-shot)	77.5	-	-
Leeroo (5-shot)	75.9	Routoo: Learning to Route to Large Language Models Effectively
Camelidae-8×34B (5-shot)	75.6	Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
Claude 3 Haiku (5-shot)	75.2	The Claude 3 Model Family: Opus, Sonnet, Haiku	-
DBRX Instruct 132B (5-shot)	73.7	The Llama 3 Herd of Models
llama 2(65b)	73.5	Scaling Instruction-Finetuned Language Models
Claude Instant 1.1 (5-shot)	73.4	Model Card and Evaluations for Claude Models	-
Llama 3.1 8B (CoT)	73.0	The Llama 3 Herd of Models
Flan-PaLM (5-shot, finetuned)	72.2	Scaling Instruction-Finetuned Language Models
Gemini Pro (5-shot)	71.8	-	-
Mixtral 8x7B (5-shot)	70.6	Mixtral of Experts
Falcon 180B (5-shot)	70.6	The Falcon Series of Open Language Models	-

0 of 61 row(s) selected.