Wenjie Yang

Other people with similar names: Wenjie Yang

2026

SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
Wenjie Yang | Mao Zheng | Mingyang Song | Zheng Li | Sitong Wang
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs rely heavily on external supervision during training, such as human-annotated reference data or trained reward models (RMs), which are expensive to obtain and difficult to scale. To address this limitation, we propose **Simple Self-Rewarding (SSR)**, a reinforcement learning (RL) framework for MT that is reference-free and relies solely on self-judging rewards. Using only 13K monolingual examples and Qwen-2.5-7B as the backbone, SSR-Zero-7B outperforms existing MT-specific LLMs as well as larger general LLMs such as Qwen2.5-32B-Instruct on English ↔ Chinese translation benchmarks including WMT23, WMT24, and FLORES200. It further demonstrates strong generalization to low-resource language pairs. In addition, when augmented with external supervision from COMET, our strongest model, SSR-X-Zero-7B, surpasses all existing open-source models under 72B parameters and performs competitively with leading closed-source systems in English ↔ Chinese translation. Our analysis highlights the effectiveness and generalizability of the self-rewarding mechanism relative to external LLM-as-a-judge approaches and demonstrates its complementary benefits when combined with trained RMs. We will publicly release our code, data, and models.

2025

pdf bib abs

FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models
Mingyang Song | Mao Zheng | Zheng Li | Wenjie Yang | Xuan Luo
Findings of the Association for Computational Linguistics: EMNLP 2025

Improving training efficiency continues to be one of the primary challenges in large-scale Reinforcement Learning (RL). In this paper, we investigate how context length and the complexity of training data influence the RL scaling training process of R1-distilled reasoning models, e.g., DeepSeek-R1-Distill-Qwen-1.5B.Our experimental results reveal that: text-green(1) simply controlling the context length and selecting the training data based on the input prompt length can effectively improve the training efficiency of RL scaling, achieving better performance with more concise CoT; text-blue(2) properly scaling the context length helps mitigate entropy collapse; text-redand (3) carefully choosing the context length facilitates achieving efficient LLM training and reasoning. Inspired by these insights, we propose FastCuRL, a curriculum RL framework with stage-wise context scaling to achieve efficient LLM training and reasoning. Extensive experimental results demonstrate that FastCuRL-1.5B-V3 significantly outperforms state-of-the-art reasoning models on five competition-level benchmarks and achieves 49.6% accuracy on AIME 2024. Furthermore, FastCuRL-1.5B-Preview surpasses DeepScaleR-1.5B-Preview on five benchmarks while only using a single node with 8 GPUs and a total of 50% of training steps.

Co-authors

Venues

Findings2

Fix author