Command Palette
Search for a command to run...
Wenhu Chen; Ming Yin; Max Ku; Pan Lu; Yixin Wan; Xueguang Ma; Jianyu Xu; Xinyi Wang; Tony Xia

Abstract
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-questions-on-theoremqa | GPT-4 (PoT) | Accuracy: 52.4 |
| natural-questions-on-theoremqa | GPT-4 (CoT) | Accuracy: 43.8 |
| natural-questions-on-theoremqa | GPT-3.5-turbo (PoT) | Accuracy: 35.6 |
| natural-questions-on-theoremqa | Claude-v1 (CoT) | Accuracy: 24.9 |
| natural-questions-on-theoremqa | Claude-v1 (PoT) | Accuracy: 25.9 |
| natural-questions-on-theoremqa | GPT-3.5-turbo (CoT) | Accuracy: 30.2 |
| natural-questions-on-theoremqa | Claude-instant (CoT) | Accuracy: 23.6 |
| natural-questions-on-theoremqa | text-davinci-003 | Accuracy: 22.8 |
| natural-questions-on-theoremqa | PaLM-2-bison (CoT) | Accuracy: 21.0 |
| natural-questions-on-theoremqa | code-davinci-002 | Accuracy: 23.9 |
| natural-questions-on-theoremqa | PaLM-2-unicorn (CoT) | Accuracy: 31.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.