HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Code Generation
Code Generation On Humaneval
Code Generation On Humaneval
Metrics
Pass@1
Results
Performance results of various models on this benchmark
Columns
Model Name
Pass@1
Paper Title
Repository
MGDebugger (DeepSeek-Coder-V2-Lite)
96.3
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
LLMDebugger (GPT 4o)
98.2
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
Llama-3 8B (HPT)
100
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles
-
Claude 3.5 Sonnet (HPT)
100
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles
-
AFlow(GPT-4o-mini)
94.7
AFlow: Automating Agentic Workflow Generation
CodeSim (GPT-4o and LDB Debugger )
97.6
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Claude 3.5 Sonnet (0-shot)
92.0
-
-
CodeSim (o3-mini)
98.8
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
L2MAC (GPT-4)
90.2
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
AgentCoder (GPT-4)
96.3
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
CodeSim (GPT-4o)
95.1
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
MapCoder (GPT-4)
93.9
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving
OctorCoder (GPT-4)
86.6
OctoPack: Instruction Tuning Code Large Language Models
-
FractalResearch : Pioneer-SWO (GPT-4-turbo)
91.65
-
-
LLMDebugger (OpenAI o1)
99.4
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
QualityFlow (Sonnet-3.5)
98.8
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
-
Nexus (Claude 3.5 Sonnet)
98.8
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation
LPW (GPT-4o)
98.2
Planning-Driven Programming: A Large Language Model Programming Workflow
Spark_FP16_medium_v4.1.1
85.97
-
-
GPT-4o (0-shot)
90.2
Claude 3.5 Sonnet Model Card Addendum
-
0 of 21 row(s) selected.
Previous
Next