SOAR: Supervision from Observation for Agentic Reinforcement Learning

Meng Li (李梦); Lei Li; Xiting Wang; Yi Yuan; Zheng Wei; Brucebian; Zang Li

SOAR: Supervision from Observation for Agentic Reinforcement Learning

Meng Li, Lei Li, Xiting Wang, Yi Yuan, Zheng Wei, Brucebian, Zang Li

Abstract

Agentic reinforcement learning enables large language models to solve long-horizon tasks by interacting with the environment and internalizing tool-use behavior into their reasoning. Prior work assigns supervision primarily based on outcome rewards or external reward models, but largely ignores environment observations, a critical source of learning. Consequently, agents may identify successful actions without understanding how the environment responds, producing suboptimal policies. To address this, we propose SOAR (Supervision from Observation for Agentic Reinforcement Learning), which assigns positive advantages to observation tokens proportional to the negative entropy of preceding actions. This encourages the agent to learn from outcomes of confident actions, grounding policy updates in environment dynamics and improving anticipation of tool-call consequences. Empirical results across three domains and 14 benchmarks show that SOAR improves performance, yielding gains of up to 7.0% on general reasoning tasks and 16.9% on deep research tasks, while reducing erroneous and inefficient tool usage.

Anthology ID:: 2026.acl-long.1624
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35175–35197
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1624/
DOI:
Bibkey:
Cite (ACL):: Meng Li, Lei Li, Xiting Wang, Yi Yuan, Zheng Wei, Brucebian, and Zang Li. 2026. SOAR: Supervision from Observation for Agentic Reinforcement Learning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35175–35197, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SOAR: Supervision from Observation for Agentic Reinforcement Learning (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1624.pdf
Checklist:: 2026.acl-long.1624.checklist.pdf

PDF Cite Search Checklist Fix data