Xiang Bai
2026
Doc-V*: Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
Yuanlei Zheng | Pei Fu | Hang Li | Ziyang Wang | Yuyi Zhang | Wenyu Ruan | Xiaojin Zhang | Zhongyu Wei | Zhenbo Luo | Jian Luan | Wei Chen | Xiang Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuanlei Zheng | Pei Fu | Hang Li | Ziyang Wang | Yuyi Zhang | Wenyu Ruan | Xiaojin Zhang | Zhongyu Wei | Zhenbo Luo | Jian Luan | Wei Chen | Xiang Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-V*, an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-V* begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-V* balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-V* outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to 47.9% over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
Jinghan Yu | Junhao Xiao | Chenyu Zhu | Jiaming Li | Jia Li | HanMing Deng | Xirui Wang | Guoli Jia | Jianjun Li | Xiang Bai | Bowen Zhou | Zhiyuan Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinghan Yu | Junhao Xiao | Chenyu Zhu | Jiaming Li | Jia Li | HanMing Deng | Xirui Wang | Guoli Jia | Jianjun Li | Xiang Bai | Bowen Zhou | Zhiyuan Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action” paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
2025
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang | Qi Liu | Yongjie Ye | Jinghui Lu | Shu Wei | An-Lan Wang | Chunhui Lin | Hao Feng | Zhen Zhao | Yanjie Wang | Yuliang Liu | Hao Liu | Xiang Bai | Can Huang
Findings of the Association for Computational Linguistics: ACL 2025
Jingqun Tang | Qi Liu | Yongjie Ye | Jinghui Lu | Shu Wei | An-Lan Wang | Chunhui Lin | Hao Feng | Zhen Zhao | Yanjie Wang | Yuliang Liu | Hao Liu | Xiang Bai | Can Huang
Findings of the Association for Computational Linguistics: ACL 2025
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks focus on high-resource languages like English and Chinese. Despite pioneering works expanding multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial “visual-textual misalignment” problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including Qwen2.5-VL, InternVL-2.5, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (InternVL-2.5 scoring 32.2 versus 79.7 for human performance), underscoring the value of MTVQA. By providing a dataset with nuanced multilingual annotations, MTVQA aims to set a new standard for benchmarks, fostering advancements in multilingual visual text comprehension.
Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning
Deng Linger | Linghao Zhu | Yuliang Liu | Yu Wang | Qunyi Xie | Jingjing Wu | Gang Zhang | Yingying Zhu | Xiang Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Deng Linger | Linghao Zhu | Yuliang Liu | Yu Wang | Qunyi Xie | Jingjing Wu | Gang Zhang | Yingying Zhu | Xiang Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Multimodal Models (LMMs) face limitations in geometric reasoning due to insufficient Chain of Thought (CoT) image-text training data. While existing approaches leverage template-based or LLM-assisted methods for geometric CoT data creation, they often face challenges in achieving both diversity and precision. To bridge this gap, we introduce a two-stage Theorem-Validated Reverse Chain-of-Thought Reasoning Synthesis (TR-CoT) framework. The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties. The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments. Our approach expands theorem-type coverage, corrects long-standing misunderstandings, and enhances geometric reasoning. Fine-grained CoT improves theorem understanding and increases logical consistency by 24.5%. Our best models surpass the baselines in MathVista and GeoQA by 10.1% and 4.7%, outperforming advanced closed-source models like GPT-4o.
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
An-Lan Wang | Jingqun Tang | Lei Liao | Hao Feng | Qi Liu | Xiang Fei | Jinghui Lu | Han Wang | Hao Liu | Yuliang Liu | Xiang Bai | Can Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
An-Lan Wang | Jingqun Tang | Lei Liao | Hao Feng | Qi Liu | Xiang Fei | Jinghui Lu | Han Wang | Hao Liu | Yuliang Liu | Xiang Bai | Can Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise scanned or digital documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models’ inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding.
2024
Deciphering Oracle Bone Language with Diffusion Models
Haisu Guan | Huanxin Yang | Xinyu Wang | Shengwei Han | Yongge Liu | Lianwen Jin | Xiang Bai | Yuliang Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haisu Guan | Huanxin Yang | Xinyu Wang | Shengwei Han | Yongge Liu | Lianwen Jin | Xiang Bai | Yuliang Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Originating from China’s Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD.
Search
Fix author
Co-authors
- Yuliang Liu 4
- Hao Feng 2
- Can Huang 2
- Hao Liu 2
- Qi Liu 2
- Jinghui Lu 2
- Jingqun Tang 2
- An-Lan Wang 2
- Wei Chen 1
- HanMing Deng 1
- Xiang Fei 1
- Pei Fu 1
- Haisu Guan 1
- Shengwei Han 1
- Guoli Jia 1
- Lianwen Jin 1
- Hang Li 1
- Jia Li 1
- Jiaming Li 1
- Jianjun Li 1
- Lei Liao 1
- Chunhui Lin 1
- Deng Linger 1
- Yongge Liu 1
- Jian Luan 1
- Zhenbo Luo 1
- Zhiyuan Ma 1
- Wenyu Ruan 1
- Han Wang (王涵) 1
- Xinyu Wang 1
- Xirui Wang 1
- Yanjie Wang 1
- Yu Wang 1
- Ziyang Wang 1
- Shu Wei 1
- Zhongyu Wei (魏忠钰) 1
- Jingjing Wu 1
- Junhao Xiao 1
- Qunyi Xie 1
- Huanxin Yang 1
- Yongjie Ye 1
- Jinghan Yu 1
- Gang Zhang 1
- Xiaojin Zhang 1
- Yuyi Zhang 1
- Zhen Zhao 1
- Yuanlei Zheng 1
- Bowen Zhou 1
- Chenyu Zhu 1
- Linghao Zhu 1
- Yingying Zhu 1