Question Answering On Triviaqa

评估指标

评测结果

各个模型在此基准测试上的表现结果

		Paper Title	Repository
Claude 2 (few-shot, k=5)	87.5	Model Card and Evaluations for Claude Models	-
GPT-4-0613	87	-	-
Claude 1.3 (few-shot, k=5)	86.7	Model Card and Evaluations for Claude Models	-
RankRAG-llama3-70b (Zero-Shot, KILT)	86.5	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs	-
PaLM 2-L (one-shot)	86.1	PaLM 2 Technical Report
ChatQA-1.5-llama3-70b (Zero-Shot, KILT)	85.6	ChatQA: Surpassing GPT-4 on Conversational QA and RAG	-
LLaMA 2 70B (one-shot)	85	Llama 2: Open Foundation and Fine-Tuned Chat Models
GPT-4-0613 (Zero-shot)	84.8	GPT-4 Technical Report
RankRAG-llama3-8b (Zero-Shot, KILT)	82.9	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs	-
PaLM 2-M (one-shot)	81.7	PaLM 2 Technical Report
PaLM-540B (One-Shot)	81.4	PaLM: Scaling Language Modeling with Pathways
PaLM-540B (Few-Shot)	81.4	PaLM: Scaling Language Modeling with Pathways
ChatQA-1.5-llama3-8B (Zero-Shot, KILT)	81.0	ChatQA: Surpassing GPT-4 on Conversational QA and RAG	-
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	79.29	Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Claude Instant 1.1 (few-shot, k=5)	78.9	Model Card and Evaluations for Claude Models	-
code-davinci-002 175B + REPLUG LSR (Few-Shot)	77.3	REPLUG: Retrieval-Augmented Black-Box Language Models
PaLM-540B (Zero-Shot)	76.9	PaLM: Scaling Language Modeling with Pathways
code-davinci-002 175B + REPLUG (Few-Shot)	76.8	REPLUG: Retrieval-Augmented Black-Box Language Models
GLaM 62B/64E (Few-shot)	75.8	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	-
GLaM 62B/64E (One-shot)	75.8	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	-

0 of 56 row(s) selected.

Command Palette

Question Answering On Triviaqa

评估指标

评测结果