Gang Chen
Other people with similar names: Gang Chen, Gang Chen
Unverified author pages with similar names: Gang Chen
2026
FinMRAGBench: A Realistic and Complex Benchmark for Multi-Modal RAG in Financial Document Analysis
Shouqing Yang | Qi Zhang | Yuhang Yang | Ruikang Xu | Yuwei Hou | Zhulin Jia | Lirong Gao | Haobo Wang | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Shouqing Yang | Qi Zhang | Yuhang Yang | Ruikang Xu | Yuwei Hou | Zhulin Jia | Lirong Gao | Haobo Wang | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for realistic financial analysis over financial documents. However, existing benchmarks fail to capture realistic financial analysis settings that involve cross-document retrieval, multi-page evidence integration, and diverse analytical tasks. To address this gap, we introduce FinMRAGBench, a comprehensive multi-modal financial RAG benchmark in which most questions require retrieving evidence scattered across multiple pages and documents, constructed from large-scale real-world annual reports and comprising 887 expert-verified QA pairs spanning five representative financial analysis tasks. Moreover, we introduce FinMRAGAgent, an agent trained on high-quality agentic trajectories following the reasoning-and-acting (ReAct) paradigm, capable of dynamic tool invocation and multi-step financial analysis. Our extensive experiments show that current multi-modal RAG systems still struggle with incomplete retrieval and complex financial reasoning. In contrast, FinMRAGAgent achieves the strongest overall performance across all models, demonstrating that our structured reasoning approach significantly enhances multi-modal RAG in realistic financial scenarios. The code and data are available at https://github.com/sqyangit/FinMRAGBench.
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
Yicheng Ji | Jun Zhang | Jinpeng Chen | Cong Wang | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yicheng Ji | Jun Zhang | Jinpeng Chen | Cong Wang | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency due to autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec is high-fidelity and rapid: it preserves >99.8% of target performance while accelerating Qwen2.5-VL-32B by 2.70 × and LLaVA-OneVision-72B by 2.94 ×. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs. Code is provided in the submitted software.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
Jun Zhang | Yicheng Ji | Feiyang Ren | Yihang Li | Bowen Zeng | Zonghao Chen | Ke Chen | Lidan Shou | Gang Chen | Huan Li
Findings of the Association for Computational Linguistics: ACL 2026
Jun Zhang | Yicheng Ji | Feiyang Ren | Yihang Li | Bowen Zeng | Zonghao Chen | Ke Chen | Lidan Shou | Gang Chen | Huan Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ”visual memory wall” in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.
Towards Interpretable Tabular Reasoning: Enhancing LLM Reasoning on Tabular Data with Pre-Constructed Logic Graph
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tabular data is widely used in fields such as finance and healthcare. Traditional tree-based models are prevalent for tabular prediction tasks due to their ability to handle heterogeneous features. However, their heavy reliance on feature engineering limits both their generalizability and their human-readable interpretability. On the other hand, Large Language Models (LLMs) naturally provide intermediate reasoning steps, thus offering greater transparency in decision-making. Nevertheless, LLMs often fail to match the predictive performance of tree-based models on tabular data. To address these challenges, we propose a novel Logic-Graph-Enhanced LLM Reasoning (LogGER) framework that integrates the strengths of tree-based models and LLMs. Specifically, we reformulate the traditional decision tree as a human-readable logic graph, which explicitly models the causal relationships between features and targets. This logic graph is automatically constructed using LLMs based on data priors and serves as the foundation for LogGER. To fully leverage the logic graph, we further introduce a logic-graph-guided process supervision approach, which evaluates and enhances the quality of LLM’s intermediate reasoning steps using logic-graph-aided process reward. Extensive experiments demonstrate that LogGER consistently outperforms both tree-based models and state-of-the-art LLM methods on a variety of tabular prediction tasks, achieving superior accuracy and interpretability.
QBridge: Bridging Natural Language and SQL via Gold Query Rewriting with Agentic Refinement
Zhensheng Luo | Sai Wu | Yuan Qiu | Chang Yao | Gang Chen | Xiu Tang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhensheng Luo | Sai Wu | Yuan Qiu | Chang Yao | Gang Chen | Xiu Tang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Natural language to SQL (NL2SQL) provides an intuitive interface for querying structured data, yet real user questions are often noisy, ambiguous, and weakly grounded to database semantics.As a result, token-level schema linking and single-pass SQL decoding can be brittle: small misunderstandings in language or schema grounding may propagate into incorrect generation.We present QBridge, an agentic, feedback-driven NL2SQL framework based on a Refined Gold Query Paradigm, which bridges natural language and SQL via Gold Query—a structured, SQL-aligned intermediate representation.A core insight of QBridge is Distilled Back-Translation (DBT) for SL-independent rewriting.DBT converts SQL-grounded supervision into execution-verified Gold-Query-style rewrites from a teacher model, and distills a lightweight, plug-and-play rewriter that generates schema-aware rewrites without requiring explicit schema linking at inference.QBridge then (i) verifies and conservatively refines the rewrite into a high-fidelity Refined Gold Query, and (ii) refines the generated SQL with dual feedback from execution validity and semantic consistency, enabling interpretable self-correction while remaining compatible with diverse SQL backbones.Extensive experiments on Spider, BIRD, and three robustness variants demonstrate that QBridge consistently improves zero-shot NL2SQL, outperforming strong prompting and agentic baselines while showing strong robustness and generalization. Code and data are available at https://github.com/WannaBSteve/QBridge.
2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji | Jun Zhang | Heming Xia | Jinpeng Chen | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yicheng Ji | Jun Zhang | Heming Xia | Jinpeng Chen | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning.Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner.Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
T2DR: A Two-Tier Deficiency-Resistant Framework for Incomplete Multimodal Learning
Han Lin | Xiu Tang | Huan Li | Wenxue Cao | Sai Wu | Chang Yao | Lidan Shou | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2025
Han Lin | Xiu Tang | Huan Li | Wenxue Cao | Sai Wu | Chang Yao | Lidan Shou | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal learning is garnering significant attention for its capacity to represent diverse human perceptions (e.g., linguistic, acoustic, and visual signals), achieving more natural and intuitive interactions with technology.However, the frequent occurrence of incomplete data, either within a single modality (intra-modality) or across different modalities (inter-modality), presents substantial challenges in reliable semantic interpretation and model reasoning.Furthermore, there is currently no robust representation learning mechanism capable of managing both intra-modality and inter-modality real-data deficiencies.To address this challenge, we present T2DR, a two-tier deficiency-resistant framework for incomplete multimodal learning, which comprises two main modules:(1) Intra-Modal Deficiency-Resistant module (IADR): To address fine-grained deficiencies, we introduce Intra-Attn to focus on the available data while avoiding excessive suppression of the missing regions.(2) Inter-Modal Deficiency-Resistant module (IEDR): To handle coarse-grained deficiencies, we propose the shared feature prediction (SFP) to leverage cross-modal shared features for preliminary data imputation. Subsequently, we apply Inter-Attn to allocate appropriate attention to each modality based on the results from the capability-aware scorer (CAS).Extensive experiments are performed on two well-known multimodal benchmarks, CMU-MOSI and CMU-MOSEI, across various missing scenarios for sentiment analysis. Experimental results show that T2DR significantly outperforms the SOTA models. Code is available at https://github.com/LH019/T2DR.
CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency
Zhanming Shen | Hao Chen | Yulei Tang | Shaolin Zhu | Wentao Ye | Xiaomeng Hu | Haobo Wang | Gang Chen | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhanming Shen | Hao Chen | Yulei Tang | Shaolin Zhu | Wentao Ye | Xiaomeng Hu | Haobo Wang | Gang Chen | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models—an answer generator and a question generator—are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
Ensembling Prompting Strategies for Zero-Shot Hierarchical Text Classification with Large Language Models
Mingxuan Xia | Zhijie Jiang | Haobo Wang | Junbo Zhao | Tianlei Hu | Gang Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Mingxuan Xia | Zhijie Jiang | Haobo Wang | Junbo Zhao | Tianlei Hu | Gang Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hierarchical text classification aims to classify documents into multiple labels within a hierarchical taxonomy, making it an essential yet challenging task in natural language processing. Recently, using Large Language Models (LLM) to tackle hierarchical text classification in a zero-shot manner has attracted increasing attention due to their cost-efficiency and flexibility. Given the challenges of understanding the hierarchy, various HTC prompting strategies have been explored to elicit the best performance from LLMs.However, our empirical study reveals that LLMs are highly sensitive to these prompting strategies—(i) within a task, different strategies yield substantially different results, and (ii) across various tasks, the relative effectiveness of a given strategy varies significantly. To address this, we propose a novel ensemble method, HiEPS, which integrates the results of diverse prompting strategies to promote LLMs’ reliability. We also introduce a path-valid voting mechanism for ensembling, which selects a valid result with the highest path frequency score. Extensive experiments on three benchmark datasets show that HiEPS boosts the performance of single prompting strategies and achieves SOTA results. The source code is available at https://github.com/MingxuanXia/HiEPS.
LongTableBench: Benchmarking Long-Context Table Reasoning across Real-World Formats and Domains
Liyao Li | Jiaming Tian | Hao Chen | Wentao Ye | Chao Ye | Haobo Wang | Ningtao Wang | Xing Fu | Gang Chen | Junbo Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Liyao Li | Jiaming Tian | Hao Chen | Wentao Ye | Chao Ye | Haobo Wang | Ningtao Wang | Xing Fu | Gang Chen | Junbo Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
We introduce LongTableBench, a benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains. It comprises 5,950 QA instances spanning 7 table formats (e.g., Markdown, HTML, SQL), 18 domains, and input lengths up to 128K tokens, including multi-turn and multi-table settings. To ensure data quality, we combine symbolic supervision, cross-model validation, and human review. Evaluating 52 LLMs—including general-purpose, table-specific, and reasoning-enhanced models—reveals that only the strongest models maintain robust performance under increasing context lengths and format diversity. We further show that end-to-end models outperform compression-based approaches, especially on tasks requiring semantic integration. LongTableBench provides a rigorous, scalable testbed for advancing long-context tabular understanding and highlights key limitations in current LLMs’ structural and reasoning capabilities. The code and data are available at https://github.com/liyaooi/LongTableBench.
Search
Fix author
Co-authors
- Haobo Wang 5
- Huan Li 4
- Lidan Shou 4
- Junbo Zhao 4
- Yicheng Ji 3
- Jun Zhang 3
- Jinpeng Chen 2
- Hao Chen 2
- Lirong Gao 2
- Sheng Guo 2
- Xiu Tang 2
- Sai Wu 2
- Chang Yao 2
- Wentao Ye 2
- Qi Zhang 2
- Bo Zheng 2
- Wenxue Cao 1
- Jinglei Chen 1
- Zonghao Chen 1
- Ke Chen 1
- Xing Fu 1
- Yuwei Hou 1
- Xiaomeng Hu 1
- Tianlei Hu 1
- Zhulin Jia 1
- Zhijie Jiang 1
- Yihang Li 1
- Liyao Li 1
- Han Lin 1
- Zhensheng Luo 1
- Yuan Qiu 1
- Feiyang Ren 1
- Zhanming Shen 1
- Yulei Tang 1
- Jiaming Tian 1
- Jiexiang Wang 1
- Cong Wang 1
- Ningtao Wang 1
- Heming Xia 1
- Mingxuan Xia 1
- Ruikang Xu 1
- Shouqing Yang 1
- Yuhang Yang 1
- Chao Ye 1
- Zhongrui Yin 1
- Zewei Yu 1
- Bowen Zeng 1
- Shaolin Zhu 1
- Yuke Zhu 1