2 days ago

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

Abstract

We present DuPO, a dual learning-based preference optimization framework thatgenerates annotation-free feedback via a generalized duality. DuPO addressestwo key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'sreliance on costly labels and applicability restricted to verifiable tasks, andtraditional dual learning's restriction to strictly dual task pairs (e.g.,translation and back-translation). Specifically, DuPO decomposes a primaltask's input into known and unknown components, then constructs its dual taskto reconstruct the unknown part using the primal output and known information(e.g., reversing math solutions to recover hidden variables), broadeningapplicability to non-invertible tasks. The quality of this reconstructionserves as a self-supervised reward to optimize the primal task, synergizingwith LLMs' ability to instantiate both tasks via a single model. Empirically,DuPO achieves substantial gains across diverse tasks: it enhances the averagetranslation quality by 2.13 COMET over 756 directions, boosts the mathematicalreasoning accuracy by an average of 6.4 points on three challenge benchmarks,and enhances performance by 9.3 points as an inference-time reranker (tradingcomputation for accuracy). These results position DuPO as a scalable, general,and annotation-free paradigm for LLM optimization.