Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive modelthat unifies image understanding, text-to-image generation, and image editingwithin a single architecture-eliminating the need for task-specific adapters orinter-module connectors-and demonstrate that compact multimodal systems canachieve state-of-the-art performance on commodity hardware. Skywork UniPicachieves a GenEval score of 0.86, surpassing most existing unified models; setsa new DPG-Bench complex-generation record of 85.5; attains 5.83 onGEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupledencoding strategy that leverages a masked autoregressive encoder for synthesisand a SigLIP2 encoder for understanding, all feeding a shared autoregressivedecoder; (2) a progressive, resolution-aware training schedule scaling from 256x 256 to 1024 x 1024 while dynamically unfreezing parameters to balancecapacity and stability; and (3) meticulously curated, 100 million-scaledatasets augmented with task-specific reward models to refine generation andediting objectives. By demonstrating that high-fidelity multimodal integrationneed not incur prohibitive resource demands, Skywork UniPic establishes apractical paradigm for deployable, high-fidelity multimodal AI. Code andweights are publicly available athttps://huggingface.co/Skywork/Skywork-UniPic-1.5B.