HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Tatyana Iazykova Denis Kapelyushnik Olga Bystrova Andrey Kutuzov

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Abstract

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU.

Benchmarks

BenchmarkMethodologyMetrics
common-sense-reasoning-on-parusmajority_class
Accuracy: 0.498
common-sense-reasoning-on-parusheuristic majority
Accuracy: 0.478
common-sense-reasoning-on-parusRandom weighted
Accuracy: 0.48
common-sense-reasoning-on-rucosmajority_class
Average F1: 0.25
EM : 0.247
common-sense-reasoning-on-rucosheuristic majority
Average F1: 0.26
EM : 0.257
common-sense-reasoning-on-rucosRandom weighted
Average F1: 0.25
EM : 0.247
common-sense-reasoning-on-rwsdheuristic majority
Accuracy: 0.669
common-sense-reasoning-on-rwsdRandom weighted
Accuracy: 0.597
common-sense-reasoning-on-rwsdmajority_class
Accuracy: 0.669
natural-language-inference-on-lidirusmajority_class
MCC: 0
natural-language-inference-on-lidirusRandom weighted
MCC: 0
natural-language-inference-on-lidirusheuristic majority
MCC: 0.147
natural-language-inference-on-rcbheuristic majority
Accuracy: 0.438
Average F1: 0.4
natural-language-inference-on-rcbRandom weighted
Accuracy: 0.374
Average F1: 0.319
natural-language-inference-on-rcbmajority_class
Accuracy: 0.484
Average F1: 0.217
natural-language-inference-on-terraRandom weighted
Accuracy: 0.483
natural-language-inference-on-terraheuristic majority
Accuracy: 0.549
natural-language-inference-on-terramajority_class
Accuracy: 0.513
question-answering-on-danetqamajority_class
Accuracy: 0.503
question-answering-on-danetqaRandom weighted
Accuracy: 0.52
question-answering-on-danetqaheuristic majority
Accuracy: 0.642
reading-comprehension-on-musercRandom weighted
Average F1: 0.45
EM : 0.071
reading-comprehension-on-musercheuristic majority
Average F1: 0.671
EM : 0.237
reading-comprehension-on-musercmajority_class
Average F1: 0.0
EM : 0.0
word-sense-disambiguation-on-russeheuristic majority
Accuracy: 0.595
word-sense-disambiguation-on-russemajority_class
Accuracy: 0.587
word-sense-disambiguation-on-russeRandom weighted
Accuracy: 0.528

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp