Command Palette
Search for a command to run...
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Junsong Chen Chongjian Ge Enze Xie Yue Wu Lewei Yao Xiaozhe Ren Zhongdao Wang Ping Luo Huchuan Lu Zhenguo Li

Abstract
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the weaker' baseline to astronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-Σare twofold: (1) High-Quality Training Data: PixArt-Σincorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σachieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-generation-on-textatlaseval | PixArt-Sigma | StyledTextSynth Clip Score: 0.2764 StyledTextSynth FID: 82.83 StyledTextSynth OCR (Accuracy): 0.42 StyledTextSynth OCR (Cer): 0.90 StyledTextSynth OCR (F1 Score): 0.62 TextScenesHQ Clip Score: 0.2347 TextScenesHQ FID: 72.62 TextScenesHQ OCR (Accuracy): 0.34 TextScenesHQ OCR (Cer): 0.91 TextScenesHQ OCR (F1 Score): 0.53 TextVisionBlend Clip Score: 0.1891 TextVisionBlend FID: 81.29 TextVisionBlend OCR (Accuracy): 2.40 TextVisionBlend OCR (Cer): 0.83 TextVsionBlend OCR (F1 Score): 1.57 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.