Natural Questions On Theoremqa

Metrics

Accuracy

Results

Performance results of various models on this benchmark

		Paper Title	Code
GPT-4 (PoT)	52.4	TheoremQA: A Theorem-driven Question Answering dataset
GPT-4 (CoT)	43.8	TheoremQA: A Theorem-driven Question Answering dataset
GPT-3.5-turbo (PoT)	35.6	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	32.5	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	32.2	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
PaLM-2-unicorn (CoT)	31.8	TheoremQA: A Theorem-driven Question Answering dataset
GPT-3.5-turbo (CoT)	30.2	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	28.2	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	27.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
Claude-v1 (PoT)	25.9	TheoremQA: A Theorem-driven Question Answering dataset
Claude-v1 (CoT)	24.9	TheoremQA: A Theorem-driven Question Answering dataset
code-davinci-002	23.9	TheoremQA: A Theorem-driven Question Answering dataset
Claude-instant (CoT)	23.6	TheoremQA: A Theorem-driven Question Answering dataset
text-davinci-003	22.8	TheoremQA: A Theorem-driven Question Answering dataset
PaLM-2-bison (CoT)	21.0	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	19.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	17.0	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	16.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	15.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

0 of 19 row(s) selected.

Natural Questions On Theoremqa

Metrics

Accuracy

Results

Performance results of various models on this benchmark

		Paper Title	Code
GPT-4 (PoT)	52.4	TheoremQA: A Theorem-driven Question Answering dataset
GPT-4 (CoT)	43.8	TheoremQA: A Theorem-driven Question Answering dataset
GPT-3.5-turbo (PoT)	35.6	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code)	32.5	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code)	32.2	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
PaLM-2-unicorn (CoT)	31.8	TheoremQA: A Theorem-driven Question Answering dataset
GPT-3.5-turbo (CoT)	30.2	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code)	28.2	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code)	27.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
Claude-v1 (PoT)	25.9	TheoremQA: A Theorem-driven Question Answering dataset
Claude-v1 (CoT)	24.9	TheoremQA: A Theorem-driven Question Answering dataset
code-davinci-002	23.9	TheoremQA: A Theorem-driven Question Answering dataset
Claude-instant (CoT)	23.6	TheoremQA: A Theorem-driven Question Answering dataset
text-davinci-003	22.8	TheoremQA: A Theorem-driven Question Answering dataset
PaLM-2-bison (CoT)	21.0	TheoremQA: A Theorem-driven Question Answering dataset
DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code)	19.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code)	17.0	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code)	16.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code)	15.4	DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

0 of 19 row(s) selected.