Command Palette
Search for a command to run...
Kurt Shuster Eric Michael Smith Da Ju Jason Weston

Abstract
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build agents with human-like abilities, we must expand beyond handling just text. A particularly important topic is the ability to see images and communicate about what is perceived. With the goal of engaging humans in multi-modal dialogue, we investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We study incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor (text-only) BlenderBot (Roller et al., 2020) in text-based conversation. We additionally investigate and incorporate safety components in our final model, and show that such efforts do not diminish model performance with respect to engagingness metrics.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-dialog-on-blendedskilltalk | Multi-Modal BlenderBot | BLEU-4: 1 F1: 17.8 ROUGE-L: 19.3 |
| visual-dialog-on-convai2 | Multi-Modal BlenderBot | BLEU-4: 1.1 F1: 18.4 ROUGE-L: 22.6 |
| visual-dialog-on-empatheticdialogues | Multi-Modal BlenderBot | BLEU-4: 1.5 F1: 19.2 ROUGE-L: 24.5 |
| visual-dialog-on-image-chat | Multi-Modal BlenderBot | BLEU-4: 40 F1: 13.1 ROUGE-L: 18 |
| visual-dialog-on-wizard-of-wikipedia | Multi-Modal BlenderBot | BLEU-4: 2.2 F1: 18.6 ROUGE-L: 17.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.