CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng


Abstract
Training expert LLMs in domains with scarce fine-grained annotated data is admittedly challenging, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While outcome-based RL may improve accuracy, it frequently compromises the reasoning process, yielding internally inconsistent rationales that diverge from the final predictions. Existing solutions to supervise the reasoning process, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using a small, general-purpose LLM only. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to better exploit annotated data available. Experiments demonstrate that CLARity can improve the consistency of responses by 16.5% over standard outcome-based RL, and bring an improvement of 7.5% in final accuracy. Human evaluations further confirm substantial gains in factual correctness and reasoning coherence, leading to more trustworthy model outputs. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert LLM training by monitoring reasoning consistency.
Anthology ID:
2026.acl-long.1358
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29460–29480
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1358/
DOI:
Bibkey:
Cite (ACL):
Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, and Yansong Feng. 2026. CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29460–29480, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts (Lin et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1358.pdf
Checklist:
 2026.acl-long.1358.checklist.pdf