The problem is that current long reasoning-chain LLMs use unnecessarily long reasoning even for simple questions. This results in higher inference costs and carbon footprint.
This paper proposes a constrained reinforcement learning framework. It controls the distribution of response groups based on reasoning length, adapting to query difficulty.
-----
https://arxiv.org/abs/2501.17974
📌 This paper smartly uses constrained reinforcement learning to address LLM inefficiency. It moves beyond fixed reasoning lengths. The method allows models to adaptively allocate compute, based on problem difficulty, improving token efficiency.
📌 Inference Budget-Constrained Policy Optimization reframes efficient reasoning as a resource allocation problem. By constraining response group densities, it directly optimizes for inference cost. This is a practical approach to make long-reasoning LLMs more deployable.
📌 The paper's strength lies in its simplicity. It distills complex constrained reinforcement learning into a weighted supervised fine-tuning update. This makes the method implementable. It leverages existing SFT techniques for efficient adaptation.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces Inference Budget-Constrained Policy Optimization (IBPO).
→ IBPO is a constrained reinforcement learning framework.
→ It controls how response groups, categorized by reasoning length, are distributed.
→ The method formulates the problem as maximizing utility under an inference budget constraint.
→ IBPO uses a weighted supervised fine-tuning update, similar to RAFT and RFT.
→ The weight for each response is determined by solving an optimization problem.
→ The optimization problem maximizes a reward margin while respecting density constraints on response groups.
→ The reward function, called reward margin, measures the advantage of one response group over others.
→ The implementation builds upon Constraint Generative Policy Optimization (CGPO).
→ CGPO is adapted to incorporate the inference budget constraint.
→ The optimization problem is solved using integer linear programming.
-----
Key Insights 💡:
→ Scaling reasoning length in LLMs leads to uni-modal behavior, causing inefficiency for simple queries.
→ A constrained RL approach can enable multi-modal behavior, adapting reasoning length to query difficulty.
→ IBPO allows models to learn the difficulty of queries.
→ Models fine-tuned with IBPO can allocate inference budgets adaptively.
→ This adaptive allocation improves performance-budget efficiency.
→ The method effectively manages inference budgets by controlling the distribution of response lengths.
-----
Results 📊:
→ Achieves up to 5.74% absolute improvement on MATH500 benchmark.
→ Achieves up to 11.2% relative improvement on MATH500 benchmark.
→ Achieves these improvements with 2.16× to 4.32× inference budgets relative to LLaMA3.1 8B Instruct.
→ These improvements are approximately 2× better than self-consistency under the same budgets.
Share this post