Command Palette
Search for a command to run...
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Andrew M. Bean; Simi Hellsten; Harry Mayne; Jabez Magomere; Ethan A. Chi; Ryan Chi; Scott A. Hale; Hannah Rose Kirk

Abstract
In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| logical-reasoning-on-lingoly | Gemini 1.5 Pro | Delta_NoContext: 23.4% Exact Match Accuracy: 32.1% |
| logical-reasoning-on-lingoly | GPT-4 | Delta_NoContext: 21.5% Exact Match Accuracy: 33.4% |
| logical-reasoning-on-lingoly | GPT-3.5 | Delta_NoContext: 11.2% Exact Match Accuracy: 21.2% |
| logical-reasoning-on-lingoly | Claude Opus | Delta_NoContext: 28.8% Exact Match Accuracy: 46.3% |
| logical-reasoning-on-lingoly | Command R+ | Delta_NoContext: 11.6% Exact Match Accuracy: 21.5% |
| logical-reasoning-on-lingoly | Llama 3 8B | Delta_NoContext: 4.9% Exact Match Accuracy: 11.4% |
| logical-reasoning-on-lingoly | Llama 3 70B | Delta_NoContext: 2.9% Exact Match Accuracy: 10.3% |
| logical-reasoning-on-lingoly | Llama 2 70B | Delta_NoContext: 1.1% Exact Match Accuracy: 6.4% |
| logical-reasoning-on-lingoly | GPT-4o | Delta_NoContext: 25.1% Exact Match Accuracy: 37.6% |
| logical-reasoning-on-lingoly | Mixtral 8x7B | Delta_NoContext: 6.4% Exact Match Accuracy: 14.2% |
| logical-reasoning-on-lingoly | Gemma 7B | Delta_NoContext: 2.2% Exact Match Accuracy: 4.9% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.