Mingxu Chai
2026
Unveiling the Deficiencies of Pre-trained Text-and-Layout Models in Real-world Visually-rich Document Information Extraction
Chong Zhang | Yixi Zhao | Yulu Xie | Chenshu Yuan | Yi Tu | Ya Guo | Mingxu Chai | Ziyu Shen | Yue Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: EACL 2026
Chong Zhang | Yixi Zhao | Yulu Xie | Chenshu Yuan | Yi Tu | Ya Guo | Mingxu Chai | Ziyu Shen | Yue Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: EACL 2026
Recently developed pre-trained text-and-layout models (PTLMs) have shown remarkable success in multiple information extraction tasks on visually-rich documents (VrDs). However, despite achieving extremely high performance on benchmarks, their real-world performance falls short of expectations. Owing to this issue, we investigate the prevailing evaluation pipeline to reveal that: (1) The inadequate annotations within benchmark datasets introduce spurious correlations between task inputs and labels, which would lead to overly-optimistic estimation of model performance. (2) The evaluation solely relies on the performance on benchmarks and is insufficient to comprehensively explore the capabilities of methods in real-world scenarios. These problems impede the prevailing evaluation pipeline from reflecting the real-world performance of methods, misleading the design choices of method optimization. In this work, we introduce EC-FUNSD, an entity-centric dataset crafted for benchmarking information extraction from visually-rich documents. This dataset contains diverse layouts and high-quality annotations. Additionally, this dataset disentangles the falsely-coupled segment and entity annotations that arises from the block-level annotation of FUNSD. Using the proposed dataset, we evaluate the real-world information extraction capabilities of PTLMs from multiple aspects, including their absolute performance, as well as generalization, robustness and fairness. The results indicate that prevalent PTLMs do not perform as well as anticipated in real-world information extraction scenarios. We hope that our study can inspire reflection on the directions of PTLM development.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
2025
Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety
Chenhao Huang | Ziyu Shen | Yicong Ren | Huiyuan Zheng | Jiazheng Zhang | Mingxu Chai | Ming Zhang | Shihan Dou | Fan Mo | Jie Shi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chenhao Huang | Ziyu Shen | Yicong Ren | Huiyuan Zheng | Jiazheng Zhang | Mingxu Chai | Ming Zhang | Shihan Dou | Fan Mo | Jie Shi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Aligning large language models (LLMs) with human preferences is a central challenge for building reliable AI systems. Most existing alignment approaches rely on static signals, such as predefined principles or offline human annotations to guide model behavior toward a fixed approximation of human preferences. However, LLMs can exhibit distributional drift during training, and static alignment mechanisms lack the capacity to adaptively correct misaligned behaviors as they emerge. To address this limitation, we develop a two-stage framework that enables dynamic and continuous alignment. In the first stage, a constitution is continually revised based on observed model behaviors, and models are trained to comply with these evolving principles. In the second stage, this learned constitution is used to guide reinforcement learning, encouraging the model to align with the updated normative signals. We refer to this framework as COCOA: Co-evolution of Constitutions and AI Models. We show that COCOA enables a 7B model to greatly improve safety—raising StrongReject score from 0.741 to 0.935 and Safe-RLHF accuracy from 77.76% to 90.64% without human annotations, reaching performance close to much larger state-of-the-art models.
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
DocFusion: A Unified Framework for Document Parsing Tasks
Mingxu Chai | Ziyu Shen | Chong Zhang | Yue Zhang | Xiao Wang | Shihan Dou | Jihua Kang | Jiazheng Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Mingxu Chai | Ziyu Shen | Chong Zhang | Yue Zhang | Xiao Wang | Shihan Dou | Jihua Kang | Jiazheng Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Document parsing involves layout element detection and recognition, essential for extracting information. However, existing methods often employ multiple models for these tasks, leading to increased system complexity and maintenance overhead. While some models attempt to unify detection and recognition, they often fail to address the intrinsic differences in data representations, thereby limiting performance in document processing. Our research reveals that recognition relies on discrete tokens, whereas detection relies on continuous coordinates, leading to challenges in gradient updates and optimization. To bridge this gap, we propose the Gaussian-Kernel Cross-Entropy Loss (GK-CEL), enabling generative frameworks to handle both tasks simultaneously. Building upon GK-CEL, we propose DocFusion, a unified document parsing model with only 0.28B parameters. Additionally, we construct the DocLatex-1.6M dataset to provide high-quality training support. Experimental results show that DocFusion, equipped with GK-CEL, performs competitively across four core document parsing tasks, validating the effectiveness of our unified approach.
2024
Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding
Chong Zhang | Yi Tu | Yixi Zhao | Chenshu Yuan | Huan Chen | Yue Zhang | Mingxu Chai | Ya Guo | Huijia Zhu | Qi Zhang | Tao Gui
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Chong Zhang | Yi Tu | Yixi Zhao | Chenshu Yuan | Huan Chen | Yue Zhang | Mingxu Chai | Ya Guo | Huijia Zhu | Qi Zhang | Tao Gui
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents.Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements.However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream tasks.To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous models. Moreover, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs.We conduct comprehensive experiments to demonstrate that the pipeline generally benefits downstream VrD tasks:(1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.
Search
Fix author
Co-authors
- Tao Gui 6
- Shihan Dou 5
- Xuan-Jing Huang (黄萱菁) 5
- Qi Zhang 5
- Zhiheng Xi 4
- Jiazheng Zhang 4
- Chenhao Huang 3
- Ziyu Shen 3
- Ming Zhang 3
- Qi Zhang 3
- Ya Guo 2
- Changhao Jiang 2
- Shichun Liu 2
- Huayu Sha 2
- Yujiong Shen 2
- Jingqi Tong 2
- Yi Tu 2
- Yuhui Wang 2
- Chenshu Yuan 2
- Chong Zhang 2
- Yue Zhang 2
- Guoqiang Zhang 2
- Yue Zhang 2
- Yixi Zhao 2
- Dingwei Zhu 2
- Chenxin An 1
- Wenxiang Chen 1
- Huan Chen 1
- Jingyi Deng 1
- Chenghao Fan 1
- Ziche Fu 1
- Wei He 1
- Binze Hu 1
- Yueyuan Huang 1
- Caishuang Huang 1
- Senjie Jin 1
- Wenqing Jing 1
- Jihua Kang 1
- Zelin Li 1
- Zhicheng Liu 1
- Fan Mo 1
- Haojie Pan 1
- Qiyuan Peng 1
- Xipeng Qiu (邱锡鹏) 1
- Yicong Ren 1
- Jie Shi 1
- Kexin Tan 1
- Yuhui Wang 1
- Junzhe Wang 1
- Yuran Wang 1
- Xiao Wang 1
- Yilong Wu 1
- Mingqi Wu 1
- Yulu Xie 1
- Junjie Ye (叶俊杰) 1
- Ming Zhang 1
- Zhihao Zhang 1
- Chong Zhang 1
- Yunke Zhang 1
- Huiyuan Zheng 1
- Enyu Zhou 1
- Huijia Zhu 1