4 months ago

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Mingyang Zhou; Runxiang Cheng; Yong Jae Lee; Zhou Yu

Abstract

We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Our approach achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets. We also collected a new multilingual multimodal product description dataset to simulate a real-world international online shopping scenario. On this dataset, our visual attention grounding model outperforms other methods by a large margin.

Code Repositories

Eurus-Holmes/VAG-NMT

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
multimodal-machine-translation-on-multi30k	VAG-NMT	BLEU (EN-DE): 31.6 Meteor (EN-DE): 52.2 Meteor (EN-FR): 70.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette