HyperAI

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Release Date: 5/21/2025
Emerging Properties in Unified Multimodal Pretraining
Abstract

Unifying multimodal understanding and generation has shown impressivecapabilities in cutting-edge proprietary systems. In this work, we introduceBAGEL, an open0source foundational model that natively supports multimodalunderstanding and generation. BAGEL is a unified, decoder0only model pretrainedon trillions of tokens curated from large0scale interleaved text, image, video,and web data. When scaled with such diverse multimodal interleaved data, BAGELexhibits emerging capabilities in complex multimodal reasoning. As a result, itsignificantly outperforms open-source unified models in both multimodalgeneration and understanding across standard benchmarks, while exhibitingadvanced multimodal reasoning abilities such as free-form image manipulation,future frame prediction, 3D manipulation, and world navigation. In the hope offacilitating further opportunities for multimodal research, we share the keyfindings, pretraining details, data creation protocal, and release our code andcheckpoints to the community. The project page is at https://bagel-ai.org/