Xiaohu Zhao
2026
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
Tianyu Dong | Yangyang Liu | Jiang Zhou | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Tianyu Dong | Yangyang Liu | Jiang Zhou | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2026
Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource language tokens are often routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders their efficacy in multilingual contexts. To address this issue, we propose SARA (Semantically Anchored Routing Alignment), a framework designed to transfer specialized capabilities from high-resource languages as anchors to low-resource languages. SARA explicitly aligns the routing distribution of multilingual inputs with high-resource semantic anchors using a symmetric Jensen-Shannon (JS) divergence constraint. Unlike traditional distillation methods that operate on output logits, SARA directly aligns the internal routing distributions of MoE layers, encouraging mechanistic consistency in expert selection across languages. We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks. Experiment results demonstrate that SARA outperforms standard instruction tuning (e.g., +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct on Global-MMLU benchmark). Further analyses show that SARA effectively addresses performance bottlenecks in low-resource languages, providing a scalable pathway to enhance multilingual capabilities in sparse architectures.
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
Ling Shi | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Yangyang Liu | Linlong Xu | Longyue Wang | Deyi Xiong | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ling Shi | Xinwei Wu | Xiaohu Zhao | Hao Wang | Heng Liu | Yangyang Liu | Linlong Xu | Longyue Wang | Deyi Xiong | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model’s internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects “Feature-Resonant Data” that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable **17.4%** on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
Jiang Zhou | Xiaohu Zhao | Xinwei Wu | Tianyu Dong | Hao Wang | Yangyang Liu | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiang Zhou | Xiaohu Zhao | Xinwei Wu | Tianyu Dong | Hao Wang | Yangyang Liu | Heng Liu | Linlong Xu | Longyue Wang | Weihua Luo | Deyi Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B’s entity translation accuracy from 23.66% to 31.87% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24pp, which scales to +1.59 with extended optimization. Extensive analyses of pass@k dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
M2PO: Multi-Perspective Multi-Pair Preference Optimization for Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aligning Large Language Models (LLMs) to human preferences is pivotal for Machine Translation (MT), yet current approaches are often hindered by misleading reward signals. Our analysis reveals that prevailing Quality Estimation (QE) models exhibit a systematic blind spot towards **partial errors**—specifically partial hallucinations and omissions—often favoring superficially fluent but unfaithful translations. To address this, we propose **M2PO** (**M**ulti-Perspective **M**ulti-Pair **P**reference **O**ptimization), a data-centric framework for preference optimization in machine translation. First, to correct the bias towards fluency, M2PO uses a multi-perspective alignment mechanism that decouples semantic fidelity from fluency, prioritizing faithfulness via a curriculum strategy. Second, with the bias corrected, partial errors fall between perfect and severely incorrect translations, making them inefficient to learn via standard best-versus-worst comparisons. We thus introduce a multi-pair objective that leverages the full candidate list to capture these fine-grained error signals. Experiments on WMT23, WMT24, and FLORES-200 show that M2PO enables a 9B model to outperform leading open-source baselines and achieve parity with proprietary models like GPT-4o and Gemini-2.0-Flash, demonstrating significant potential for efficient, high-fidelity LLM-based translation.
2025
Marco Large Translation Model at WMT2025: Transforming Translation Capability in LLMs via Quality-Aware Training and Decoding
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
This paper presents the Marco-MT-Algharb system, our submission to the WMT2025 General Machine Translation Shared Task from Alibaba International Digital Commerce (AIDC). Built on a large language model (LLM) foundation, the system’s strong performance stems from novel quality-aware training and decoding techniques: (1) a two-step supervised fine-tuning (SFT) process incorporating data distillation, (2) a two-step reinforcement learning (RL) framework for preference alignment, and (3) a hybrid decoding strategy that integrates word alignment with Minimum Bayes Risk (MBR) re-ranking to improve faithfulness. These approaches jointly ensure high accuracy and robustness across diverse languages and domains. In the official human evaluation, our system secured five first‐place finishes, one second, and four third‐place results in the constrained category across the 13 directions we participated in. Notably, for the English-Chinese, our results surpassed all open/closed‐source systems.
2023
CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation
Yikun Lei | Zhengshan Xue | Xiaohu Zhao | Haoran Sun | Shaolin Zhu | Xiaodong Lin | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2023
Yikun Lei | Zhengshan Xue | Xiaohu Zhao | Haoran Sun | Shaolin Zhu | Xiaodong Lin | Deyi Xiong
Findings of the Association for Computational Linguistics: ACL 2023
Distilling knowledge from a high-resource task, e.g., machine translation, is an effective way to alleviate the data scarcity problem of end-to-end speech translation. However, previous works simply use the classical knowledge distillation that does not allow for adequate transfer of knowledge from machine translation. In this paper, we propose a comprehensive knowledge distillation framework for speech translation, CKDST, which is capable of comprehensively and effectively distilling knowledge from machine translation to speech translation from two perspectives: cross-modal contrastive representation distillation and simultaneous decoupled knowledge distillation. In the former, we leverage a contrastive learning objective to optmize the mutual information between speech and text representations for representation distillation in the encoder. In the later, we decouple the non-target class knowledge from target class knowledge for logits distillation in the decoder. Experiments on the MuST-C benchmark dataset demonstrate that our CKDST substantially improves the baseline by 1.2 BLEU on average in all translation directions, and outperforms previous state-of-the-art end-to-end and cascaded speech translation models.
CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation
Xiaohu Zhao | Haoran Sun | Yikun Lei | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2023
Xiaohu Zhao | Haoran Sun | Yikun Lei | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2023
Deep neural networks have demonstrated their capacity in extracting features from speech inputs. However, these features may include non-linguistic speech factors such as timbre and speaker identity, which are not directly related to translation. In this paper, we propose a content-centric speech representation disentanglement learning framework for speech translation, CCSRD, which decomposes speech representations into content representations and non-linguistic representations via representation disentanglement learning. CCSRD consists of a content encoder that encodes linguistic content information from the speech input, a non-content encoder that models non-linguistic speech features, and a disentanglement module that learns disentangled representations with a cyclic reconstructor, feature reconstructor and speaker classifier trained in a multi-task learning way. Experiments on the MuST-C benchmark dataset demonstrate that CCSRD achieves an average improvement of +0.9 BLEU in two settings across five translation directions over the baseline, outperforming state-of-the-art end-to-end speech translation models and cascaded models.
Towards a Deep Understanding of Multilingual End-to-End Speech Translation
Haoran Sun | Xiaohu Zhao | Yikun Lei | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2023
Haoran Sun | Xiaohu Zhao | Yikun Lei | Shaolin Zhu | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2023
In this paper, we employ Singular Value Canonical Correlation Analysis (SVCCA) to analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. SVCCA enables us to estimate representational similarity across languages and layers, enhancing our understanding of the functionality of multilingual speech translation and its potential connection to multilingual neural machine translation. The multilingual speech translation model is trained on the CoVoST 2 dataset in all possible directions, and we utilize LASER to extract parallel bitext data for SVCCA analysis. We derive three major findings from our analysis: (I) Linguistic similarity loses its efficacy in multilingual speech translation when the training data for a specific language is limited. (II) Enhanced encoder representations and well-aligned audio-text data significantly improve translation quality, surpassing the bilingual counterparts when the training data is not compromised. (III) The encoder representations of multilingual speech translation demonstrate superior performance in predicting phonetic features in linguistic typology prediction. With these findings, we propose that releasing the constraint of limited data for low-resource languages and subsequently combining them with linguistically related high-resource languages could offer a more effective approach for multilingual end-to-end speech translation.