LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su


Abstract
Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon in which LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model’s own over-optimization. To mitigate this issue, we propose LLM-based Constraint Optimization (LCO), a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: self-thought module, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and guided evolutionary exploration module, which employs LLM-based crossover and mutation to constrain the model’s actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.Our code is available at: https://github.com/Califoni/LCO_for_ICRH.
Anthology ID:
2026.findings-acl.1390
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27910–27934
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1390/
DOI:
Bibkey:
Cite (ACL):
Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, and Hang Su. 2026. LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27910–27934, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks (Wan et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1390.pdf
Checklist:
 2026.findings-acl.1390.checklist.pdf