9 days ago

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promisingalternative to autoregressive (AR) LLMs for text generation, with the potentialto decode multiple tokens in a single iteration. However, none of the existingopen-source dLLMs have achieved superior inference speed over AR LLMs ofsimilar size. This paper breaks this barrier based on a simple and effectivestrategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two keycapabilities: (1) block-wise autoregressive generation to enable KV cacheutilization; (2) prediction of following tokens without requiring completion ofprior blocks for inter-block parallel decoding. In this way, the vanilla dLLMsare refurbished into an AR-diffusion hybrid paradigm for efficient inference.D2F can be implemented with an asymmetric distillation process based onpre-trained dLLMs. We further propose a pipelined parallel decoding algorithm,which enables a trade-off between efficiency and efficacy. Empirically, D2FdLLMs achieve more than 2.5times inference speed than LLaMA3 andQwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, theacceleration can be more than 50times while maintaining comparableoutput quality. The code is available athttps://github.com/zhijie-group/Discrete-Diffusion-Forcing.