HyperAI

Code Generation On Humaneval

Metrics

Pass@1

Results

Performance results of various models on this benchmark

Model Name
Pass@1
Paper TitleRepository
MGDebugger (DeepSeek-Coder-V2-Lite)96.3From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
LLMDebugger (GPT 4o)98.2Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
Llama-3 8B (HPT)100Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles-
Claude 3.5 Sonnet (HPT)100Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles-
AFlow(GPT-4o-mini)94.7AFlow: Automating Agentic Workflow Generation
CodeSim (GPT-4o and LDB Debugger )97.6CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Claude 3.5 Sonnet (0-shot)92.0--
CodeSim (o3-mini)98.8CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
L2MAC (GPT-4)90.2L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
AgentCoder (GPT-4)96.3AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
CodeSim (GPT-4o)95.1CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
MapCoder (GPT-4)93.9MapCoder: Multi-Agent Code Generation for Competitive Problem Solving
OctorCoder (GPT-4)86.6OctoPack: Instruction Tuning Code Large Language Models-
FractalResearch : Pioneer-SWO (GPT-4-turbo)91.65--
LLMDebugger (OpenAI o1)99.4Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
QualityFlow (Sonnet-3.5)98.8QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks-
Nexus (Claude 3.5 Sonnet)98.8Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation
LPW (GPT-4o)98.2Planning-Driven Programming: A Large Language Model Programming Workflow
Spark_FP16_medium_v4.1.185.97--
GPT-4o (0-shot)90.2Claude 3.5 Sonnet Model Card Addendum-
0 of 21 row(s) selected.