Coreference Resolution On Winograd Schema

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name	Accuracy	Paper Title	Repository
WKH	57.1	WinoGrande: An Adversarial Winograd Schema Challenge at Scale	-
Random chance baseline	50	Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema	-
UDSSM-I (ensemble)	57.1	Unsupervised Deep Structured Semantic Models for Commonsense Reasoning	-
Char-level CNN+LSTM (partial scoring)	57.9	A Simple Method for Commonsense Reasoning	-
DeBERTa-1.5B	95.9	DeBERTa: Decoding-enhanced BERT with Disentangled Attention	-
KEE+NKAM winner of the WSC2016	58.3	Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge	-
RoBERTa-WinoGrande 355M	90.1	WinoGrande: An Adversarial Winograd Schema Challenge at Scale	-
T5-Large 738M	66.7	LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions	-
BERT-base 110M (fine-tuned on WSCR)	62.3	A Surprisingly Robust Trick for Winograd Schema Challenge	-
BERTwiki 340M (fine-tuned on WSCR)	72.5	A Surprisingly Robust Trick for Winograd Schema Challenge	-
UDSSM-II (ensemble)	62.4	Unsupervised Deep Structured Semantic Models for Commonsense Reasoning	-
RoBERTa-large 354M	73.9	Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema	-
Turing NLR v5 XXL 5.4B (fine-tuned)	97.3	Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE	-
USSM + Supervised DeepNet + KB	52.8	Attention Is (not) All You Need for Commonsense Reasoning	-
PaLM 540B (1-shot)	86.3	PaLM: Scaling Language Modeling with Pathways	-
Hybrid H3 125M (3-shot, logit scoring)	43.3	Hungry Hungry Hippos: Towards Language Modeling with State Space Models	-
GPT-3 175B (few-shot)	80.1	Language Models are Few-Shot Learners	-
GPT-2-XL 1.5B	73.3	LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions	-
LaMini-F-T5 783M	64.1	LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions	-
GPT-2 Medium 774M (full scoring)	64.5	How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG	-

0 of 82 row(s) selected.