Thinkless: LLM Learns When to Think

Reasoning Language Models, capable of extended chain-of-thought reasoning,have demonstrated remarkable performance on tasks requiring complex logicalinference. However, applying elaborate reasoning for all queries often resultsin substantial computational inefficiencies, particularly when many problemsadmit straightforward solutions. This motivates an open question: Can LLMslearn when to think? To answer this, we propose Thinkless, a learnableframework that empowers an LLM to adaptively select between short-form andlong-form reasoning, based on both task complexity and the model's ability.Thinkless is trained under a reinforcement learning paradigm and employs twocontrol tokens, for concise responses and for detailedreasoning. At the core of our method is a Decoupled Group Relative PolicyOptimization (DeGRPO) algorithm, which decomposes the learning objective ofhybrid reasoning into two components: (1) a control token loss that governs theselection of the reasoning mode, and (2) a response loss that improves theaccuracy of the generated answers. This decoupled formulation enablesfine-grained control over the contributions of each objective, stabilizingtraining and effectively preventing collapse observed in vanilla GRPO.Empirically, on several benchmarks such as Minerva Algebra, MATH-500, andGSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% -90%, significantly improving the efficiency of Reasoning Language Models. Thecode is available at https://github.com/VainF/Thinkless