Command Palette
Search for a command to run...
Taehoon Kim; Gwangmo Song; Sihaeng Lee; Sangyun Kim; Yewon Seo; Soonyoung Lee; Seung Hwan Kim; Honglak Lee; Kyunghoon Bae

Abstract
Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-coco-captions | L-Verse | BLEU-4: 39.9 METEOR: 31.4 ROUGE-L: 60.4 SPICE: 23.3 |
| image-reconstruction-on-imagenet-256x256 | AugVAE-SL | FID: 3.28 |
| image-reconstruction-on-imagenet-256x256 | AugVAE-ML | FID: 1.04 |
| text-to-image-generation-on-coco | L-Verse | FID: 45.8 FID-1: 41.9 FID-2: 35.5 FID-4: 30.2 FID-8: 29.83 |
| text-to-image-generation-on-coco | L-Verse-CC | FID: 37.2 FID-1: 31.6 FID-2: 25.7 FID-4: 21.4 FID-8: 21.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.