Emerging Properties in Unified Multimodal Pretraining

Unifying multimodal understanding and generation has shown impressivecapabilities in cutting-edge proprietary systems. In this work, we introduceBAGEL, an open0source foundational model that natively supports multimodalunderstanding and generation. BAGEL is a unified, decoder0only model pretrainedon trillions of tokens curated from large0scale interleaved text, image, video,and web data. When scaled with such diverse multimodal interleaved data, BAGELexhibits emerging capabilities in complex multimodal reasoning. As a result, itsignificantly outperforms open-source unified models in both multimodalgeneration and understanding across standard benchmarks, while exhibitingadvanced multimodal reasoning abilities such as free-form image manipulation,future frame prediction, 3D manipulation, and world navigation. In the hope offacilitating further opportunities for multimodal research, we share the keyfindings, pretraining details, data creation protocal, and release our code andcheckpoints to the community. The project page is at https://bagel-ai.org/