Bo Zeng
2026
Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation
Linfeng Gao | Qinggang Zhang | Baolong Bi | Bo Zeng | Zheng Yuan | Zerui Chen | Zhimin Wei | Shenghua Liu | Linlong Xu | Longyue Wang | Weihua Luo | Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2026
Linfeng Gao | Qinggang Zhang | Baolong Bi | Bo Zeng | Zheng Yuan | Zerui Chen | Zhimin Wei | Shenghua Liu | Linlong Xu | Longyue Wang | Weihua Luo | Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-Augmented Generation (RAG) systems often fail to maintain contextual faithfulness, generating responses that conflict with the provided context. Existing methods attempt to improve faithfulness through external interventions, such as specialized prompting, decoding-based calibration, or preference optimization. However, since these approaches treat the LLM as a black box, they lack a reliable mechanism to assess how these conflicts occur. Consequently, they tend to be brittle, data-intensive, and agnostic to the model’s internal reasoning process. In this paper, we move beyond black-box interventions to analyze the model’s internal reasoning process. We discover that conflicting and aligned knowledge states are linearly separable in the model’s latent space, and contextual noise systematically increases the entropy of these representations. Based on these findings, we propose ProbeRAG, a novel framework for faithful RAG that operates in three stages: (i) fine-grained knowledge pruning to filter irrelevant context, (ii) latent conflict probing to identify hard conflicts in the model’s latent space, and (iii) conflict-aware attention to modulate attention heads toward faithful context integration. Extensive experiments demonstrate that ProbeRAG substantially improves both accuracy and contextual faithfulness. The related resources are available at https://github.com/XMUDeepLIT/ProbeRAG.
M2PO: Multi-Perspective Multi-Pair Preference Optimization for Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Liangying Shao | Yichen Dong | Xinwei Wu | Jiang Zhou | Tianyu Dong | Xiangxiang Zeng | Longyue Wang | Weihua Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aligning Large Language Models (LLMs) to human preferences is pivotal for Machine Translation (MT), yet current approaches are often hindered by misleading reward signals. Our analysis reveals that prevailing Quality Estimation (QE) models exhibit a systematic blind spot towards **partial errors**—specifically partial hallucinations and omissions—often favoring superficially fluent but unfaithful translations. To address this, we propose **M2PO** (**M**ulti-Perspective **M**ulti-Pair **P**reference **O**ptimization), a data-centric framework for preference optimization in machine translation. First, to correct the bias towards fluency, M2PO uses a multi-perspective alignment mechanism that decouples semantic fidelity from fluency, prioritizing faithfulness via a curriculum strategy. Second, with the bias corrected, partial errors fall between perfect and severely incorrect translations, making them inefficient to learn via standard best-versus-worst comparisons. We thus introduce a multi-pair objective that leverages the full candidate list to capture these fine-grained error signals. Experiments on WMT23, WMT24, and FLORES-200 show that M2PO enables a 9B model to outperform leading open-source baselines and achieve parity with proprietary models like GPT-4o and Gemini-2.0-Flash, demonstrating significant potential for efficient, high-fidelity LLM-based translation.
2025
Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language
Bo Zeng | Chenyang Lyu | Sinuo Liu | Mingyan Zeng | Minghao Wu | Xuanfan Ni | Tianqi Shi | Yu Zhao | Yefeng Liu | Chenyu Zhu | Ruizhe Li | Jiahui Geng | Qing Li | Yu Tong | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bo Zeng | Chenyang Lyu | Sinuo Liu | Mingyan Zeng | Minghao Wu | Xuanfan Ni | Tianqi Shi | Yu Zhao | Yefeng Liu | Chenyu Zhu | Ruizhe Li | Jiahui Geng | Qing Li | Yu Tong | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction-following capability has become a major ability to be evaluated for Large Language Models. However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by 7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF will be made publicly available to the community.
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Huifeng Yin | Yu Zhao | Minghao Wu | Xuanfan Ni | Bo Zeng | Hao Wang | Tianqi Shi | Liangying Shao | Chenyang Lyu | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Huifeng Yin | Yu Zhao | Minghao Wu | Xuanfan Ni | Bo Zeng | Hao Wang | Tianqi Shi | Liangying Shao | Chenyang Lyu | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distillation post-training on LRMs-generated data is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e., formalistic long-time thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing data from scratch using Monte Carlo Tree Search (MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the MCTS data. We conducted evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our CoT-aware approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking.
Marco Large Translation Model at WMT2025: Transforming Translation Capability in LLMs via Quality-Aware Training and Decoding
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
This paper presents the Marco-MT-Algharb system, our submission to the WMT2025 General Machine Translation Shared Task from Alibaba International Digital Commerce (AIDC). Built on a large language model (LLM) foundation, the system’s strong performance stems from novel quality-aware training and decoding techniques: (1) a two-step supervised fine-tuning (SFT) process incorporating data distillation, (2) a two-step reinforcement learning (RL) framework for preference alignment, and (3) a hybrid decoding strategy that integrates word alignment with Minimum Bayes Risk (MBR) re-ranking to improve faithfulness. These approaches jointly ensure high accuracy and robustness across diverse languages and domains. In the official human evaluation, our system secured five first‐place finishes, one second, and four third‐place results in the constrained category across the 13 directions we participated in. Notably, for the English-Chinese, our results surpassed all open/closed‐source systems.
2021
PINGAN Omini-Sinitic at SemEval-2021 Task 4:Reading Comprehension of Abstract Meaning
Ye Wang | Yanmeng Wang | Haijun Zhu | Bo Zeng | Zhenghong Hao | Shaojun Wang | Jing Xiao
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Ye Wang | Yanmeng Wang | Haijun Zhu | Bo Zeng | Zhenghong Hao | Shaojun Wang | Jing Xiao
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper describes the winning system for subtask 2 and the second-placed system for subtask 1 in SemEval 2021 Task 4: ReadingComprehension of Abstract Meaning. We propose to use pre-trianed Electra discriminator to choose the best abstract word from five candidates. An upper attention and auto denoising mechanism is introduced to process the long sequences. The experiment results demonstrate that this contribution greatly facilitatesthe contextual language modeling in reading comprehension task. The ablation study is also conducted to show the validity of our proposed methods.
Search
Fix author
Co-authors
- Weihua Luo 5
- Longyue Wang 5
- Linlong Xu 3
- Kaifu Zhang 3
- Heng Liu 2
- Yangyang Liu 2
- Chenyang Lyu 2
- Xuanfan Ni (倪宣凡) 2
- Liangying Shao 2
- Tianqi Shi 2
- Hao Wang 2
- Minghao Wu 2
- Yu Zhao 2
- Xiaohu Zhao 2
- Baolong Bi 1
- Zerui Chen 1
- Yichen Dong 1
- Tianyu Dong 1
- Linfeng Gao 1
- Jiahui Geng 1
- Zhenghong Hao 1
- Ruizhe Li 1
- Qing Li 1
- Sinuo Liu 1
- Yefeng Liu 1
- Shenghua Liu 1
- Jinsong Su 1
- Yu Tong 1
- Hao Wang 1
- Ye Wang 1
- Yanmeng Wang 1
- Shaojun Wang 1
- Zhimin Wei 1
- Xinwei Wu 1
- Jing Xiao 1
- Huifeng Yin 1
- Zheng Yuan 1
- Mingyan Zeng 1
- Xiangxiang Zeng 1
- Qinggang Zhang 1
- Jiang Zhou 1
- Chenyu Zhu 1
- Haijun Zhu 1