HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TheoremQA: A Theorem-driven Question Answering dataset

Wenhu Chen; Ming Yin; Max Ku; Pan Lu; Yixin Wan; Xueguang Ma; Jianyu Xu; Xinyi Wang; Tony Xia

TheoremQA: A Theorem-driven Question Answering dataset

Abstract

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

Code Repositories

wenhuchen/theoremqa
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-questions-on-theoremqaGPT-4 (PoT)
Accuracy: 52.4
natural-questions-on-theoremqaGPT-4 (CoT)
Accuracy: 43.8
natural-questions-on-theoremqaGPT-3.5-turbo (PoT)
Accuracy: 35.6
natural-questions-on-theoremqaClaude-v1 (CoT)
Accuracy: 24.9
natural-questions-on-theoremqaClaude-v1 (PoT)
Accuracy: 25.9
natural-questions-on-theoremqaGPT-3.5-turbo (CoT)
Accuracy: 30.2
natural-questions-on-theoremqaClaude-instant (CoT)
Accuracy: 23.6
natural-questions-on-theoremqatext-davinci-003
Accuracy: 22.8
natural-questions-on-theoremqaPaLM-2-bison (CoT)
Accuracy: 21.0
natural-questions-on-theoremqacode-davinci-002
Accuracy: 23.9
natural-questions-on-theoremqaPaLM-2-unicorn (CoT)
Accuracy: 31.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp