Long Context Understanding On Ada Leval
评估指标
12k
16k
1k
2k
4k
6k
8k
评测结果
各个模型在此基准测试上的表现结果
比较表格
模型名称 | 12k | 16k | 1k | 2k | 4k | 6k | 8k |
---|---|---|---|---|---|---|---|
judging-llm-as-a-judge-with-mt-bench-and-1 | 1.9 | 1.0 | 37.0 | 11.1 | 5.8 | 3.2 | 1.8 |
模型 2 | 12.0 | 11.0 | 65.0 | 43.5 | 23.5 | 15.0 | 17.0 |
judging-llm-as-a-judge-with-mt-bench-and-1 | 1.6 | 0.8 | 32.4 | 10.7 | 5.7 | 3.1 | 1.9 |
judging-llm-as-a-judge-with-mt-bench-and-1 | 1.4 | 0.9 | 53.4 | 29.2 | 13.1 | 4.3 | 2.2 |
模型 5 | 2.5 | 2.5 | 61.5 | 48.5 | 41.5 | 29.5 | 17.0 |
glm-130b-an-open-bilingual-pre-trained-model | 0.9 | 0.5 | 39.8 | 18.8 | 9.0 | 5.0 | 3.4 |
glm-130b-an-open-bilingual-pre-trained-model | 0.0 | 0.3 | 31.2 | 10.9 | 4.5 | 1.6 | 1.6 |
internlm2-technical-report | 2.0 | 0.8 | 58.6 | 49.5 | 33.9 | 12.3 | 13.4 |
gpt-4-technical-report-1 | 52.0 | 44.5 | 73.5 | 73.5 | 65.5 | 63.0 | 56.5 |
gpt-4-technical-report-1 | 49.5 | 44.0 | 74.0 | 73.5 | 67.5 | 59.5 | 53.5 |