HyperAI

Coreference Resolution On Winograd Schema

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name
Accuracy
Paper TitleRepository
WKH57.1WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Random chance baseline50Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema-
UDSSM-I (ensemble)57.1Unsupervised Deep Structured Semantic Models for Commonsense Reasoning-
Char-level CNN+LSTM (partial scoring)57.9A Simple Method for Commonsense Reasoning
DeBERTa-1.5B95.9DeBERTa: Decoding-enhanced BERT with Disentangled Attention
KEE+NKAM winner of the WSC201658.3Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge-
RoBERTa-WinoGrande 355M90.1WinoGrande: An Adversarial Winograd Schema Challenge at Scale
T5-Large 738M66.7LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
BERT-base 110M (fine-tuned on WSCR)62.3A Surprisingly Robust Trick for Winograd Schema Challenge
BERTwiki 340M (fine-tuned on WSCR)72.5A Surprisingly Robust Trick for Winograd Schema Challenge
UDSSM-II (ensemble)62.4Unsupervised Deep Structured Semantic Models for Commonsense Reasoning-
RoBERTa-large 354M73.9Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema-
Turing NLR v5 XXL 5.4B (fine-tuned)97.3Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE-
USSM + Supervised DeepNet + KB52.8Attention Is (not) All You Need for Commonsense Reasoning
PaLM 540B (1-shot)86.3PaLM: Scaling Language Modeling with Pathways
Hybrid H3 125M (3-shot, logit scoring)43.3Hungry Hungry Hippos: Towards Language Modeling with State Space Models
GPT-3 175B (few-shot)80.1Language Models are Few-Shot Learners
GPT-2-XL 1.5B73.3LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
LaMini-F-T5 783M64.1LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
GPT-2 Medium 774M (full scoring)64.5How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
0 of 82 row(s) selected.