Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang
Abstract
Extending large language models (LLMs) to low-resource languages often incurs an “align- ment tax”: improvements in the target lan- guage come at the cost of catastrophic forget- ting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimiza- tion (GRPO), where the model is optimized us- ing embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flex- ible realizations, enabling controlled updates that reduce destructive interference with pre- trained knowledge. We evaluate our approach on Tibetan–Chinese machine translation and Ti- betan headline generation. Experiments show that our method acquires low-resource capa- bilities while markedly mitigating alignment tax, preserving general competence more effec- tively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher se- mantic quality and preference in open-ended generation, and few-shot transfer results indi- cate that it learns more transferable and ro- bust representations under limited supervision. Overall, our study demonstrates that reinforce- ment learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.- Anthology ID:
- 2026.findings-acl.880
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17772–17786
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.880/
- DOI:
- Cite (ACL):
- Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, and Wentao Zhang. 2026. Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17772–17786, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax (Su et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.880.pdf