AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong; Jingbo Zhou; Jingyong Ye; Qiang Huang; Dejing Dou

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou

Abstract

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, existing group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a margin-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks and model series demonstrate the superior performance of AAPO. Code is available at https://github.com/JianxXiong/AAPO.

Anthology ID:: 2026.acl-long.1131
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24663–24680
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1131/
DOI:
Bibkey:
Cite (ACL):: Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, and Dejing Dou. 2026. AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24663–24680, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin (Xiong et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1131.pdf
Checklist:: 2026.acl-long.1131.checklist.pdf

PDF Cite Search Checklist Fix data