Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints
Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yingchun Wang
Abstract
Large language models with search capabilities frequently exhibit miscalibrated confidence, producing incorrect answers with high certainty. We present Deliberative Searcher, a reasoning-primary framework that integrates search operations into chain-of-thought generation while maintaining explicit confidence calibration. Our method employs constrained reinforcement learning with adaptive Lagrangian multipliers to jointly optimize correctness and reliability. Experiments across five benchmarks demonstrate substantial improvements: our 7B model reduces average false-certain rates from 54% in baselines to 2%, while our 72B variant achieves competitive accuracy with closed-source models and reduces false-certain rates to 9%. The well-calibrated confidence scores also enable more efficient test-time compute: instead of standard majority voting, we use confidence-weighted aggregation and match the performance of 16-sample majority voting with only 4 samples, a 4× reduction in inference compute. These results establish calibrated confidence as a foundation for both trustworthy outputs and adaptive test-time compute, demonstrating the value of the proposed constrained RL framework in search-augmented language models.- Anthology ID:
- 2026.acl-long.199
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4340–4354
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.199/
- DOI:
- Cite (ACL):
- Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, and Yingchun Wang. 2026. Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4340–4354, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints (Yin et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.199.pdf