HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Tatiana Shavrina; Alena Fenogenova; Anton Emelyanov; Denis Shevelev; Ekaterina Artemova; Valentin Malykh; Vladislav Mikhailov; Maria Tikhonova; Andrey Chertok; Andrey Evlampiev

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Abstract

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.

Code Repositories

RussianNLP/MOROCCO
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
common-sense-reasoning-on-parusBaseline TF-IDF1.1
Accuracy: 0.486
common-sense-reasoning-on-parusHuman Benchmark
Accuracy: 0.982
common-sense-reasoning-on-rucosHuman Benchmark
Average F1: 0.93
EM : 0.89
common-sense-reasoning-on-rucosBaseline TF-IDF1.1
Average F1: 0.26
EM : 0.252
common-sense-reasoning-on-rwsdBaseline TF-IDF1.1
Accuracy: 0.662
common-sense-reasoning-on-rwsdHuman Benchmark
Accuracy: 0.84
natural-language-inference-on-lidirusHuman Benchmark
MCC: 0.626
natural-language-inference-on-lidirusBaseline TF-IDF1.1
MCC: 0.06
natural-language-inference-on-rcbHuman Benchmark
Accuracy: 0.702
Average F1: 0.68
natural-language-inference-on-rcbBaseline TF-IDF1.1
Accuracy: 0.441
Average F1: 0.301
natural-language-inference-on-terraHuman Benchmark
Accuracy: 0.92
natural-language-inference-on-terraBaseline TF-IDF1.1
Accuracy: 0.471
question-answering-on-danetqaHuman Benchmark
Accuracy: 0.915
question-answering-on-danetqaBaseline TF-IDF1.1
Accuracy: 0.621
reading-comprehension-on-musercBaseline TF-IDF1.1
Average F1: 0.587
EM : 0.242
reading-comprehension-on-musercHuman Benchmark
Average F1: 0.806
EM : 0.42
word-sense-disambiguation-on-russeBaseline TF-IDF1.1
Accuracy: 0.57
word-sense-disambiguation-on-russeHuman Benchmark
Accuracy: 0.805

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp