Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints

Zhenyun Yin; Shujie Wang; Xuhong Wang; Xingjun Ma; Yingchun Wang

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints

Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yingchun Wang

Abstract

Large language models with search capabilities frequently exhibit miscalibrated confidence, producing incorrect answers with high certainty. We present Deliberative Searcher, a reasoning-primary framework that integrates search operations into chain-of-thought generation while maintaining explicit confidence calibration. Our method employs constrained reinforcement learning with adaptive Lagrangian multipliers to jointly optimize correctness and reliability. Experiments across five benchmarks demonstrate substantial improvements: our 7B model reduces average false-certain rates from 54% in baselines to 2%, while our 72B variant achieves competitive accuracy with closed-source models and reduces false-certain rates to 9%. The well-calibrated confidence scores also enable more efficient test-time compute: instead of standard majority voting, we use confidence-weighted aggregation and match the performance of 16-sample majority voting with only 4 samples, a 4× reduction in inference compute. These results establish calibrated confidence as a foundation for both trustworthy outputs and adaptive test-time compute, demonstrating the value of the proposed constrained RL framework in search-augmented language models.

Anthology ID:: 2026.acl-long.199
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4340–4354
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.199/
DOI:
Bibkey:
Cite (ACL):: Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, and Yingchun Wang. 2026. Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4340–4354, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints (Yin et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.199.pdf
Checklist:: 2026.acl-long.199.checklist.pdf

PDF Cite Search Checklist Fix data