Qiming Peng
2026
Beyond Ranking: Fine-Grained Diagnostics and Self-Improvement for MLLMs
Mingze Xu | Zijing Zhao | Qiming Peng | Houwen Peng | Han Hu | Zhanhui Kang | Yuxing Han
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mingze Xu | Zijing Zhao | Qiming Peng | Houwen Peng | Han Hu | Zhanhui Kang | Yuxing Han
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Multimodal Large Language Models (MLLMs) are advancing rapidly, accurately evaluating their capabilities remains challenging. Current paradigms primarily rely on holistic scoring and static leaderboards, which fail to disentangle fine-grained competencies. Specifically, they suffer from “Outcome Bias” by validating only final answers and ignoring intermediate reasoning. To address these limitations, we introduce ATOM (AnaTomy Of MLLM), a novel MLLM-as-a-judge framework designed to shift the focus from ranking to fine-grained diagnosis. ATOM decomposes complex reasoning into atomic criteria anchored in visual elements, enforcing verification against explicit visual facts. Validated on a newly constructed benchmark with rigorous human rankings, ATOM achieves state-of-the-art accuracy, surpassing the strongest baseline by up to 7.92%. Moving beyond ranking, ATOM bridges the gap between assessment and alignment: by pinpointing atomic-level failures, it establishes a closed-loop mechanism for targeted self-correction. This approach enables models to identify and rectify errors autonomously, successfully resolving up to 39.95% of previously failed queries without human intervention.
2022
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
Qiming Peng | Yinxu Pan | Wenjin Wang | Bin Luo | Zhenyu Zhang | Zhengjie Huang | Yuhui Cao | Weichong Yin | Yongfeng Chen | Yin Zhang | Shikun Feng | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Qiming Peng | Yinxu Pan | Wenjin Wang | Bin Luo | Zhenyu Zhang | Zhengjie Huang | Yuhui Cao | Weichong Yin | Yongfeng Chen | Yin Zhang | Shikun Feng | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at PaddleNLP.