Tianyu Dong
2026
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
Tianyu Dong | Yangyang Liu | Jiang Zhou | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Tianyu Dong | Yangyang Liu | Jiang Zhou | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource language tokens are often routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders their efficacy in multilingual contexts. To address this issue, we propose SARA (Semantically Anchored Routing Alignment), a framework designed to transfer specialized capabilities from high-resource languages as anchors to low-resource languages. SARA explicitly aligns the routing distribution of multilingual inputs with high-resource semantic anchors using a symmetric Jensen-Shannon (JS) divergence constraint. Unlike traditional distillation methods that operate on output logits, SARA directly aligns the internal routing distributions of MoE layers, encouraging mechanistic consistency in expert selection across languages. We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks. Experiment results demonstrate that SARA outperforms standard instruction tuning (e.g., +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct on Global-MMLU benchmark). Further analyses show that SARA effectively addresses performance bottlenecks in low-resource languages, providing a scalable pathway to enhance multilingual capabilities in sparse architectures.
From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Lei Yang | Leiyu Pan | Bojian Xiong | Renren Jin | Shaowei Zhang | Yue Chen | Ling Shi | Jiang Zhou | Junru Wu | Zhen Wang | Jianxiang Peng | Juesi Xiao | Tianyu Dong | Zhuowen Han | Zhuo Chen | Yuqi Ren | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lei Yang | Leiyu Pan | Bojian Xiong | Renren Jin | Shaowei Zhang | Yue Chen | Ling Shi | Jiang Zhou | Junru Wu | Zhen Wang | Jianxiang Peng | Juesi Xiao | Tianyu Dong | Zhuowen Han | Zhuo Chen | Yuqi Ren | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
Jiang Zhou | Xiaohu Zhao | Xinwei Wu | Tianyu Dong | Hao Wang | Yangyang Liu | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiang Zhou | Xiaohu Zhao | Xinwei Wu | Tianyu Dong | Hao Wang | Yangyang Liu | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B’s entity translation accuracy from 23.66% to 31.87% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24pp, which scales to +1.59 with extended optimization. Extensive analyses of pass@k dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
M2PO: Multi-Perspective Multi-Pair Preference Optimization for Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aligning Large Language Models (LLMs) to human preferences is pivotal for Machine Translation (MT), yet current approaches are often hindered by misleading reward signals. Our analysis reveals that prevailing Quality Estimation (QE) models exhibit a systematic blind spot towards **partial errors**—specifically partial hallucinations and omissions—often favoring superficially fluent but unfaithful translations. To address this, we propose **M2PO** (**M**ulti-Perspective **M**ulti-Pair **P**reference **O**ptimization), a data-centric framework for preference optimization in machine translation. First, to correct the bias towards fluency, M2PO uses a multi-perspective alignment mechanism that decouples semantic fidelity from fluency, prioritizing faithfulness via a curriculum strategy. Second, with the bias corrected, partial errors fall between perfect and severely incorrect translations, making them inefficient to learn via standard best-versus-worst comparisons. We thus introduce a multi-pair objective that leverages the full candidate list to capture these fine-grained error signals. Experiments on WMT23, WMT24, and FLORES-200 show that M2PO enables a 9B model to outperform leading open-source baselines and achieve parity with proprietary models like GPT-4o and Gemini-2.0-Flash, demonstrating significant potential for efficient, high-fidelity LLM-based translation.
2025
MLAS-LoRA: Language-Aware Parameters Detection and LoRA-Based Knowledge Transfer for Multilingual Machine Translation
Tianyu Dong | Bo Li | Jinsong Liu | Shaolin Zhu | Deyi Xiong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyu Dong | Bo Li | Jinsong Liu | Shaolin Zhu | Deyi Xiong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable progress in multilingual machine translation (MT), demonstrating strong performance even with limited parallel data. However, effectively fine-tuning LLMs for MT is challenging due to parameter interference, which arises from the conflicting demands of different language pairs and the risk of overwriting pre-trained knowledge. To address this issue, we propose MLAS-LoRA, a novel multiple language-aware LoRA knowledge transfer framework. MLAS-LoRA efficiently adapts LLMs to MT by selectively transferring knowledge from a large teacher to a small student model. Our approach first evaluates the awareness of neurons and extracts linguistic knowledge in the teacher model to both the general MT task and specific language pairs.We then propose a multiple language-specific LoRA architecture to inject the extracted knowledge into the student model. During fine-tuning, only the parameters of the relevant language-general and language-specific LoRA modules are updated. Experimental results on diverse multilingual language pairs demonstrate that MLAS-LoRA significantly outperforms strong baselines by +1.7 BLEU on average, including standard fine-tuning and other parameter-efficient methods.
Search
Fix author
Co-authors
- Deyi Xiong (德意 熊) 4
- Jiang Zhou 4
- Yangyang Liu 3
- Heng Liu 3
- Weihua Luo 3
- Hao Wang 3
- Longyue Wang 3
- Xinwei Wu 3
- Linlong Xu 3
- Xiaohu Zhao 3
- Shaolin Zhu 2
- Yue Chen 1
- Zhuo Chen 1
- Yichen Dong 1
- Zhuowen Han 1
- Renren Jin 1
- Bo Li 1
- Jinsong Liu 1
- Leiyu Pan 1
- Jianxiang Peng 1
- Yuqi Ren 1
- Liangying Shao 1
- Ling Shi 1
- Zhen Wang 1
- Junru Wu 1
- Juesi Xiao 1
- Bojian Xiong 1
- Lei Yang 1
- Bo Zeng 1
- Xiangxiang Zeng 1
- Shaowei Zhang 1