Junan Chen

2026

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit overthinking, where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

pdf bib abs

Recent preference optimization algorithms such as Direct Preference Optimization (DPO) have become prevalent for aligning large language models (LLMs) with human preferences. FocalPO improves upon DPO by introducing a modulating factor that down-weighs misranked preference pairs. However, using a fixed modulating factor throughout training is suboptimal, as the model’s learning capacity evolves during training. We introduce DynamicFocalPO, which employs a dynamic focusing strategy that adapts over the course of training. Inspired by curriculum learning, our method initially focuses on correctly ranked samples to establish a solid foundation, then gradually incorporates harder samples as training progresses. Experiments demonstrate that DynamicFocalPO surpasses both DPO and FocalPO on benchmarks including Alpaca Eval 2.0 and Arena-Hard using Mistral-Base-7B and Llama-3-Instruct-8B. We further provide theoretical analysis showing that the dynamic schedule enables adaptive entropy regularization and selective gradient suppression.

Co-authors

Venues

Findings2

Fix author