HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Hung Le; Doyen Sahoo; Nancy F. Chen; Steven C.H. Hoi

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Abstract

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.

Code Repositories

henryhungle/MTN
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
dialogue-state-tracking-on-simmc2-0MTN
Act F1: 93.4
Slot F1: 76.7
response-generation-on-simmc2-0MTN
BLEU: 21.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems | Papers | HyperAI