Next Visual Granularity Generation

We propose a novel approach to image generation by decomposing an image intoa structured sequence, where each element in the sequence shares the samespatial resolution but differs in the number of unique tokens used, capturingdifferent level of visual granularity. Image generation is carried out throughour newly introduced Next Visual Granularity (NVG) generation framework, whichgenerates a visual granularity sequence beginning from an empty image andprogressively refines it, from global layout to fine details, in a structuredmanner. This iterative process encodes a hierarchical, layered representationthat offers fine-grained control over the generation process across multiplegranularity levels. We train a series of NVG models for class-conditional imagegeneration on the ImageNet dataset and observe clear scaling behavior. Comparedto the VAR series, NVG consistently outperforms it in terms of FID scores (3.30-> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis toshowcase the capability and potential of the NVG framework. Our code and modelswill be released.