Kaifu Zhang
2026
USB: A COMPREHENSIVE AND UNIFIED SAFETY EVALUATION BENCHMARK FOR MULTIMODAL LARGE LANGUAGE MODELS
Baolin Zheng | Guanlin Chen | Qingyang Teng | Hongqiong Zhong | Yingshui Tan | Zhendong Liu | Weixun Wang | Jiaheng Liu | Jian Yang | Huiyun Jing | Jincheng Wei | Wenbo Su | Xiaoyong Zhu | Bo Zheng | Kaifu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Baolin Zheng | Guanlin Chen | Qingyang Teng | Hongqiong Zhong | Yingshui Tan | Zhendong Liu | Weixun Wang | Jiaheng Liu | Jian Yang | Huiyun Jing | Jincheng Wei | Wenbo Su | Xiaoyong Zhu | Bo Zheng | Kaifu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite their rapid advancement, Multimodal Large Language Models (MLLMs) remain vulnerable to diverse safety risks. Current benchmarks fail to provide reliable assessments due to limited risk coverage, insufficient scale, and the oversight of complex modality combinations (e.g., cross-modal risks). To address this, we introduce the Unified Safety Benchmark (USB), a comprehensive framework covering 61 risk categories across four distinct modality interactions. We first demonstrate that existing benchmarks—even when aggregated—leave significant coverage gaps. To bridge this, we design a sophisticated data synthesis pipeline that generates complementary data, ensuring balanced coverage across all risk dimensions. Furthermore, beyond evaluating vulnerability to harmful queries, USB incorporates the simultaneous assessment of model over-refusal on benign inputs as an integrated diagnostic suite. Experimental results, evaluating 22 MLLMs across 244 risk-modality intersections, demonstrate that existing MLLMs still struggle with the trade-off between avoiding vulnerabilities and over-refusal. Models are particularly vulnerable to image-only or cross-modal risky inputs, highlighting the persistent need for refined safety mechanisms. Warning: This paper contains unfiltered and potentially harmful content that may be offensive.
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Sensen Gao | Shanshan Zhao | Xu Jiang | Lunhao Duan | Yong Xien Chng | Qing-Guo Chen | Weihua Luo | Kaifu Zhang | Jia-Wang Bian | Mingming Gong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sensen Gao | Shanshan Zhao | Xu Jiang | Lunhao Duan | Yong Xien Chng | Qing-Guo Chen | Weihua Luo | Kaifu Zhang | Jia-Wang Bian | Mingming Gong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.
MirrorCAPTCHA: Wild CAPTCHA, Wild Distribution, Wild Web-based Platform Meet Multimodal LLM Agents
Xiangyu Wu | Yuwei Hu | Tianyu Cui | Yueying Tian | Qing-Guo Chen | Zhao Xu | Weihua Luo | Kaifu Zhang | Yang Yang | Jianfeng Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangyu Wu | Yuwei Hu | Tianyu Cui | Yueying Tian | Qing-Guo Chen | Zhao Xu | Weihua Luo | Kaifu Zhang | Yang Yang | Jianfeng Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The path to fully autonomous web agents is currently hindered by a critical bottleneck: their limited ability to handle CAPTCHA. Existing agent benchmarks largely ignore this practical challenge, failing to evaluate an agent’s real-world capacity to solve CAPTCHA. To bridge this gap, we conduct a comprehensive analysis of real-world CAPTCHA distributions and introduce MirrorCAPTCHA, a benchmark annotated with Weighted Pass Rate and a newly proposed metric Completion Degree. MirrorCAPTCHA is designed to serve as a “mirror” that faithfully reflects the automation capabilities of agents in real scenarios. We filter 2095 websites from Common Crawl, identify the CAPTCHA deployed on these sites, and cluster them into 18 distinct categories using K-means algorithm. To ensure practicality, we extract a web subgraph from Common Crawl covering these websites and use random walks to simulate real-world CAPTCHA encounter frequencies, yielding a realistic measure of agents’ ability. Additionally, we develop a lightweight synthetic data pipeline to train Ovis2-Agent-CAPTCHA-8B, which significantly outperforms current state-of-the-art closed-source models on MirrorCAPTCHA, achieving a 9.4% higher average Weighted Pass Rate and a 2.13% higher average Completion Degree than the runner-up, Gemini-2.5-Pro.
2025
Marco Large Translation Model at WMT2025: Transforming Translation Capability in LLMs via Quality-Aware Training and Decoding
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
Hao Wang | Linlong Xu | Heng Liu | Yangyang Liu | Xiaohu Zhao | Bo Zeng | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the Tenth Conference on Machine Translation
This paper presents the Marco-MT-Algharb system, our submission to the WMT2025 General Machine Translation Shared Task from Alibaba International Digital Commerce (AIDC). Built on a large language model (LLM) foundation, the system’s strong performance stems from novel quality-aware training and decoding techniques: (1) a two-step supervised fine-tuning (SFT) process incorporating data distillation, (2) a two-step reinforcement learning (RL) framework for preference alignment, and (3) a hybrid decoding strategy that integrates word alignment with Minimum Bayes Risk (MBR) re-ranking to improve faithfulness. These approaches jointly ensure high accuracy and robustness across diverse languages and domains. In the official human evaluation, our system secured five first‐place finishes, one second, and four third‐place results in the constrained category across the 13 directions we participated in. Notably, for the English-Chinese, our results surpassed all open/closed‐source systems.
(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts
Minghao Wu | Jiahao Xu | Yulin Yuan | Gholamreza Haffari | Longyue Wan | Weihua Luo | Kaifu Zhang
Transactions of the Association for Computational Linguistics, Volume 13
Minghao Wu | Jiahao Xu | Yulin Yuan | Gholamreza Haffari | Longyue Wan | Weihua Luo | Kaifu Zhang
Transactions of the Association for Computational Linguistics, Volume 13
Literary translations remains one of the most challenging frontiers in machine translation due to the complexity of capturing figurative language, cultural nuances, and unique stylistic elements. In this work, we introduce TransAgents, a novel multi-agent framework that simulates the roles and collaborative practices of a human translation company, including a CEO, Senior Editor, Junior Editor, Translator, Localization Specialist, and Proofreader. The translation process is divided into two stages: a preparation stage where the team is assembled and comprehensive translation guidelines are drafted, and an execution stage that involves sequential translation, localization, proofreading, and a final quality check. Furthermore, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP), which evaluates translations based solely on target language quality and cultural appropriateness, and BLP, which leverages large language models like gpt-4 for direct text comparison. Although TransAgents achieves lower d-BLEU scores, due to the limited diversity of references, its translations are significantly better than those of other baselines and are preferred by both human evaluators and LLMs over traditional human references and gpt-4 translations. Our findings highlight the potential of multi-agent collaboration in enhancing translation quality, particularly for longer texts.1
LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy
Zhiwen Ruan | Yixia Li | He Zhu | Longyue Wang | Weihua Luo | Kaifu Zhang | Yun Chen | Guanhua Chen
Findings of the Association for Computational Linguistics: NAACL 2025
Zhiwen Ruan | Yixia Li | He Zhu | Longyue Wang | Weihua Luo | Kaifu Zhang | Yun Chen | Guanhua Chen
Findings of the Association for Computational Linguistics: NAACL 2025
Despite being pretrained on multilingual corpora, large language models (LLMs) exhibit suboptimal performance on low-resource languages. Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models. However, these methods typically focus on the encoder’s output, overlooking valuable information from other layers. We propose Layer-Wise Adaptive Fusion and Alignment Strategy (LayAlign), a framework that integrates representations from all encoder layers, coupled with the adaptive fusion-enhanced attention mechanism to enable layer-wise interaction between the LLM and the multilingual encoder. Extensive experiments on multilingual reasoning tasks, along with analyses of learned representations, show that our approach consistently outperforms existing baselines.
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
Yingshui Tan | Boren Zheng | Baihui Zheng | Kerui Cao | Huiyun Jing | Jincheng Wei | Jiaheng Liu | Yancheng He | Wenbo Su | Xiaoyong Zhu | Bo Zheng | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yingshui Tan | Boren Zheng | Baihui Zheng | Kerui Cao | Huiyun Jing | Jincheng Wei | Jiaheng Liu | Yancheng He | Wenbo Su | Xiaoyong Zhu | Bo Zheng | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the rapid advancement of Large Language Models (LLMs), significant safety concerns have emerged. Fundamentally, the safety of large language models is closely linked to the accuracy, comprehensiveness, and clarity of their understanding of safety knowledge, particularly in domains such as law, policy and ethics. This factuality ability is crucial in determining whether these models can be deployed and applied safely and compliantly within specific regions. To address these challenges and better evaluate the factuality ability of LLMs to answer short question, we introduce the Chinese SafetyQA benchmark. Chinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate, safety-related, harmless). Based on Chinese SafetyQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG ability and robustness against attacks.
A Unified Agentic Framework for Evaluating Conditional Image Generation
Jifang Wang | Xue Yang | Longyue Wang | Zhenran Xu | Yiyu Wang | Yaowei Wang | Weihua Luo | Kaifu Zhang | Baotian Hu | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jifang Wang | Xue Yang | Longyue Wang | Zhenran Xu | Yiyu Wang | Yaowei Wang | Weihua Luo | Kaifu Zhang | Baotian Hu | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Notably, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. These findings indicate that CIGEval holds great potential for automating evaluation of image generation tasks while maintaining human-level reliability.
Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language
Bo Zeng | Chenyang Lyu | Sinuo Liu | Mingyan Zeng | Minghao Wu | Xuanfan Ni | Tianqi Shi | Yu Zhao | Yefeng Liu | Chenyu Zhu | Ruizhe Li | Jiahui Geng | Qing Li | Yu Tong | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bo Zeng | Chenyang Lyu | Sinuo Liu | Mingyan Zeng | Minghao Wu | Xuanfan Ni | Tianqi Shi | Yu Zhao | Yefeng Liu | Chenyu Zhu | Ruizhe Li | Jiahui Geng | Qing Li | Yu Tong | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction-following capability has become a major ability to be evaluated for Large Language Models. However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by 7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF will be made publicly available to the community.
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Huifeng Yin | Yu Zhao | Minghao Wu | Xuanfan Ni | Bo Zeng | Hao Wang | Tianqi Shi | Liangying Shao | Chenyang Lyu | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Huifeng Yin | Yu Zhao | Minghao Wu | Xuanfan Ni | Bo Zeng | Hao Wang | Tianqi Shi | Liangying Shao | Chenyang Lyu | Longyue Wang | Weihua Luo | Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distillation post-training on LRMs-generated data is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e., formalistic long-time thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing data from scratch using Monte Carlo Tree Search (MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the MCTS data. We conducted evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our CoT-aware approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking.
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Zhenran Xu | Xue Yang | Yiyu Wang | Qingli Hu | Zijiao Wu | Longyue Wang | Weihua Luo | Kaifu Zhang | Baotian Hu | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Zhenran Xu | Xue Yang | Yiyu Wang | Qingli Hu | Zijiao Wu | Longyue Wang | Weihua Luo | Kaifu Zhang | Baotian Hu | Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce **ComfyUI-Copilot**, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.
Search
Fix author
Co-authors
- Weihua Luo 9
- Longyue Wang 6
- Minghao Wu 3
- Bo Zeng 3
- Qing-Guo Chen 2
- Baotian Hu 2
- Huiyun Jing 2
- Jiaheng Liu 2
- Chenyang Lyu 2
- Xuanfan Ni 2
- Tianqi Shi 2
- Wenbo Su 2
- Yingshui Tan 2
- Hao Wang 2
- Yiyu Wang 2
- Jincheng Wei 2
- Zhenran Xu 2
- Xue Yang 2
- Yu Zhao 2
- Bo Zheng 2
- Xiaoyong Zhu 2
- Jia-Wang Bian 1
- Kerui Cao 1
- Guanhua Chen 1
- Guanlin Chen 1
- Yun Chen 1
- Yong Xien Chng 1
- Tianyu Cui 1
- Lunhao Duan 1
- Sensen Gao 1
- Jiahui Geng 1
- Mingming Gong 1
- Gholamreza Haffari 1
- Yancheng He 1
- Qingli Hu 1
- Yuwei Hu 1
- Xu Jiang 1
- Qing Li 1
- Ruizhe Li 1
- Yixia Li 1
- Heng Liu 1
- Sinuo Liu 1
- Yangyang Liu 1
- Yefeng Liu 1
- Zhendong Liu 1
- Jianfeng Lu 1
- Zhiwen Ruan 1
- Liangying Shao 1
- Qingyang Teng 1
- Yueying Tian 1
- Yu Tong 1
- Longyue Wan 1
- Jifang Wang 1
- Weixun Wang 1
- Yaowei Wang 1
- Xiangyu Wu 1
- Zijiao Wu 1
- Jiahao Xu 1
- Linlong Xu 1
- Zhao Xu 1
- Jian Yang 1
- Yang Yang 1
- Huifeng Yin 1
- Yulin Yuan 1
- Mingyan Zeng 1
- Min Zhang 1
- Min Zhang 1
- Shanshan Zhao 1
- Xiaohu Zhao 1
- Baihui Zheng 1
- Baolin Zheng 1
- Boren Zheng 1
- Hongqiong Zhong 1
- Chenyu Zhu 1
- He Zhu 1