5 months ago

Taming Transformers for High-Resolution Image Synthesis

Esser Patrick ; Rombach Robin ; Ommer Björn

Abstract

Designed to learn long-range interactions on sequential data, transformerscontinue to show state-of-the-art results on a wide variety of tasks. Incontrast to CNNs, they contain no inductive bias that prioritizes localinteractions. This makes them expressive, but also computationally infeasiblefor long sequences, such as high-resolution images. We demonstrate howcombining the effectiveness of the inductive bias of CNNs with the expressivityof transformers enables them to model and thereby synthesize high-resolutionimages. We show how to (i) use CNNs to learn a context-rich vocabulary of imageconstituents, and in turn (ii) utilize transformers to efficiently model theircomposition within high-resolution images. Our approach is readily applied toconditional synthesis tasks, where both non-spatial information, such as objectclasses, and spatial information, such as segmentations, can control thegenerated image. In particular, we present the first results onsemantically-guided synthesis of megapixel images with transformers and obtainthe state of the art among autoregressive models on class-conditional ImageNet.Code and pretrained models can be found athttps://github.com/CompVis/taming-transformers .

Code Repositories

joanrod/ocr-vqgan

pytorch

Mentioned in GitHub

dome272/VQGAN

pytorch

Mentioned in GitHub

xiaoiker/meta_dpm

pytorch

Mentioned in GitHub

alibaba/EasyNLP/tree/master/examples/text2image_generation

jax

YvanG/VQGAN-CLIP

pytorch

Mentioned in GitHub

hyn2028/llm-cxr

pytorch

Mentioned in GitHub

tgisaturday/taming-transformers-tpu

jax

Mentioned in GitHub

joh-fischer/PlantLDM

pytorch

Mentioned in GitHub

samb-t/unleashing-transformers

pytorch

Mentioned in GitHub

v-iashin/SpecVQGAN

pytorch

Mentioned in GitHub

dome272/vqgan-pytorch

pytorch

Mentioned in GitHub

FaceOnLive/DeepFake-Detection-SDK-Linux

CompVis/taming-transformers

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
deepfake-detection-on-fakeavceleb-1	VQGAN	AP: 55.0 ROC AUC: 51.8
image-generation-on-celeba-256x256	VQGAN	FID: 10.2
image-generation-on-celeba-hq-256x256	VQGAN+Transformer	FID: 10.2
image-generation-on-ffhq-256-x-256	VQGAN+Transformer	FID: 9.6
image-generation-on-imagenet-256x256	VQGAN+Transformer (k=600, p=1.0, a=0.05)	FID: 5.2
image-generation-on-imagenet-256x256	VQGAN+Transformer (k=mixed, p=1.0, a=0.005)	FID: 6.59
image-outpainting-on-lhqc	Taming	Block-FID (Right Extend): 22.53 Block-FID (Down Extend): 26.38 Block-FID (Left Extend): - Block-FID (Up Extend): -
image-reconstruction-on-imagenet	Taming-VQGAN (16x16)	FID: 3.64 LPIPS: 0.177 PSNR: 19.93 SSIM: 0.542
image-to-image-translation-on-ade20k-labels	VQGAN+Transformer	FID: 35.5
image-to-image-translation-on-coco-stuff	VQGAN+Transformer	FID: 22.4
text-to-image-generation-on-conceptual	VQ-GAN	FID: 28.86
text-to-image-generation-on-lhqc	Taming	Block-FID: 38.89

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Taming Transformers for High-Resolution Image Synthesis

Esser Patrick ; Rombach Robin ; Ommer Bj&#xf6;rn

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Esser Patrick ; Rombach Robin ; Ommer Björn