Sitong Wang
2026
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
Wenjie Yang | Mao Zheng | Mingyang Song | Zheng Li | Sitong Wang
Findings of the Association for Computational Linguistics: ACL 2026
Wenjie Yang | Mao Zheng | Mingyang Song | Zheng Li | Sitong Wang
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs rely heavily on external supervision during training, such as human-annotated reference data or trained reward models (RMs), which are expensive to obtain and difficult to scale. To address this limitation, we propose **Simple Self-Rewarding (SSR)**, a reinforcement learning (RL) framework for MT that is reference-free and relies solely on self-judging rewards. Using only 13K monolingual examples and Qwen-2.5-7B as the backbone, SSR-Zero-7B outperforms existing MT-specific LLMs as well as larger general LLMs such as Qwen2.5-32B-Instruct on English ↔ Chinese translation benchmarks including WMT23, WMT24, and FLORES200. It further demonstrates strong generalization to low-resource language pairs. In addition, when augmented with external supervision from COMET, our strongest model, SSR-X-Zero-7B, surpasses all existing open-source models under 72B parameters and performs competitively with leading closed-source systems in English ↔ Chinese translation. Our analysis highlights the effectiveness and generalizability of the self-rewarding mechanism relative to external LLM-as-a-judge approaches and demonstrates its complementary benefits when combined with trained RMs. We will publicly release our code, data, and models.