HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Andrew M. Bean; Simi Hellsten; Harry Mayne; Jabez Magomere; Ethan A. Chi; Ryan Chi; Scott A. Hale; Hannah Rose Kirk

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Abstract

In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.

Code Repositories

am-bean/lingOly
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
logical-reasoning-on-lingolyGemini 1.5 Pro
Delta_NoContext: 23.4%
Exact Match Accuracy: 32.1%
logical-reasoning-on-lingolyGPT-4
Delta_NoContext: 21.5%
Exact Match Accuracy: 33.4%
logical-reasoning-on-lingolyGPT-3.5
Delta_NoContext: 11.2%
Exact Match Accuracy: 21.2%
logical-reasoning-on-lingolyClaude Opus
Delta_NoContext: 28.8%
Exact Match Accuracy: 46.3%
logical-reasoning-on-lingolyCommand R+
Delta_NoContext: 11.6%
Exact Match Accuracy: 21.5%
logical-reasoning-on-lingolyLlama 3 8B
Delta_NoContext: 4.9%
Exact Match Accuracy: 11.4%
logical-reasoning-on-lingolyLlama 3 70B
Delta_NoContext: 2.9%
Exact Match Accuracy: 10.3%
logical-reasoning-on-lingolyLlama 2 70B
Delta_NoContext: 1.1%
Exact Match Accuracy: 6.4%
logical-reasoning-on-lingolyGPT-4o
Delta_NoContext: 25.1%
Exact Match Accuracy: 37.6%
logical-reasoning-on-lingolyMixtral 8x7B
Delta_NoContext: 6.4%
Exact Match Accuracy: 14.2%
logical-reasoning-on-lingolyGemma 7B
Delta_NoContext: 2.2%
Exact Match Accuracy: 4.9%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp