HyperAI

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Zhang, Siyue ; Zhao, Yilun ; Geng, Liyuan ; Cohan, Arman ; Luu, Anh Tuan ; Zhao, Chen
Release Date: 5/22/2025
Diffusion vs. Autoregressive Language Models: A Text Embedding
  Perspective
Abstract

Large language model (LLM)-based embedding models, benefiting from largescale pre-training and post-training, have begun to surpass BERT and T5-basedmodels on general-purpose text embedding tasks such as document retrieval.However, a fundamental limitation of LLM embeddings lies in the unidirectionalattention used during autoregressive pre-training, which misaligns with thebidirectional nature of text embedding tasks. To this end, We propose adoptingdiffusion language models for text embeddings, motivated by their inherentbidirectional architecture and recent success in matching or surpassing LLMsespecially on reasoning tasks. We present the first systematic study of thediffusion language embedding model, which outperforms the LLM-based embeddingmodel by 20% on long-document retrieval, 8% on reasoning-intensive retrieval,2% on instruction-following retrieval, and achieve competitive performance ontraditional text embedding benchmarks. Our analysis verifies that bidirectionalattention is crucial for encoding global context in long and complex text.