Command Palette
Search for a command to run...
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Shikib Mehri Maxine Eskenazi

Abstract
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| dialogue-evaluation-on-usr-personachat | USR - MLM | Pearson Correlation: 0.0788 Spearman Correlation: 0.0795 |
| dialogue-evaluation-on-usr-personachat | USR - DR (x = f) | Pearson Correlation: -0.0454 Spearman Correlation: -0.0495 |
| dialogue-evaluation-on-usr-personachat | USR - DR (x = c) | Pearson Correlation: 0.6087 Spearman Correlation: 0.4814 |
| dialogue-evaluation-on-usr-personachat | USR | Pearson Correlation: 0.4115 Spearman Correlation: 0.4693 |
| dialogue-evaluation-on-usr-topicalchat | USR | Pearson Correlation: 0.4220 Spearman Correlation: 0.4192 |
| dialogue-evaluation-on-usr-topicalchat | USR - DR (x = c) | Pearson Correlation: 0.4068 Spearman Correlation: 0.3245 |
| dialogue-evaluation-on-usr-topicalchat | USR - DR (x = f) | Pearson Correlation: 0.3221 Spearman Correlation: 0.1419 |
| dialogue-evaluation-on-usr-topicalchat | USR - MLM | Pearson Correlation: 0.3345 Spearman Correlation: 0.3086 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.