Ding Zou
2025
Curr-ReFT: Overcoming Training Bottlenecks in Small-scale Vision-Language Models via Curriculum Reinforcement Finetuning
Huilin Deng
|
Ding Zou
|
Xinghao Zhao
|
Rui Ma
|
Yanming Guo
|
Yang Cao
|
Yu Kang
Findings of the Association for Computational Linguistics: EMNLP 2025
State-of-the-art vision-language models (VLMs) require massive scaling that limits practical deployment. Small-scale VLMs offer a practical alternative but face out-of-domain (OOD) collapse when trained with traditional supervised fine-tuning (SFT). Through GeneralPoints experiments, we identify that OOD collapse is due to SFT’s tendency to induce visual hallucinations under distribution shifts, whereas Reinforcement Learning’s (RL) bidirectional reward-driven mechanism with iterative error correction refines visual perception. Although RL-based post-training effectively mitigates OOD degradation, it faces a critical sparse reward dilemma in complex visual reasoning tasks. To this end, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), comprising two sequential stages: (1) Structured Curriculum Reinforcement Learning, which progressively evolves task formats and reward functions to match models’ growing capabilities; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality examples. Extensive experiments demonstrate that Curr-ReFT achieves state-of-the-art performance across various visual tasks in both in- and out-of-domain settings and benchmarks.