Code Generation On Humaneval
评估指标
Pass@1
评测结果
各个模型在此基准测试上的表现结果
比较表格
模型名称 | Pass@1 |
---|---|
from-code-to-correctness-closing-the-last | 96.3 |
ldb-a-large-language-model-debugger-via | 98.2 |
hierarchical-prompting-taxonomy-a-universal | 100 |
hierarchical-prompting-taxonomy-a-universal | 100 |
aflow-automating-agentic-workflow-generation | 94.7 |
codesim-multi-agent-code-generation-and-1 | 97.6 |
模型 7 | 92.0 |
codesim-multi-agent-code-generation-and-1 | 98.8 |
l2mac-large-language-model-automatic-computer | 90.2 |
agentcoder-multi-agent-based-code-generation | 96.3 |
codesim-multi-agent-code-generation-and-1 | 95.1 |
mapcoder-multi-agent-code-generation-for | 93.9 |
octopack-instruction-tuning-code-large | 86.6 |
模型 14 | 91.65 |
ldb-a-large-language-model-debugger-via | 99.4 |
qualityflow-an-agentic-workflow-for-program | 98.8 |
nexus-a-lightweight-and-scalable-multi-agent | 98.8 |
planning-driven-programming-a-large-language | 98.2 |
模型 19 | 85.97 |
claude-3-5-sonnet-model-card-addendum | 90.2 |
metagpt-meta-programming-for-multi-agent | 85.9 |