Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Instruction-following is essential for aligning large language models (LLMs)with user intent. While recent reasoning-oriented models exhibit impressiveperformance on complex mathematical problems, their ability to adhere tonatural language instructions remains underexplored. In this work, we introduceMathIF, a dedicated benchmark for evaluating instruction-following inmathematical reasoning tasks. Our empirical analysis reveals a consistenttension between scaling up reasoning capacity and maintaining controllability,as models that reason more effectively often struggle to comply with userdirectives. We find that models tuned on distilled long chains-of-thought ortrained with reasoning-oriented reinforcement learning often degrade ininstruction adherence, especially when generation length increases.Furthermore, we show that even simple interventions can partially recoverobedience, though at the cost of reasoning performance. These findingshighlight a fundamental tension in current LLM training paradigms and motivatethe need for more instruction-aware reasoning models. We release the code anddata at https://github.com/TingchenFu/MathIF.