Lirong Gao
2026
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
Lirong Gao | Zeqing Wang | Yuyan Cai | Jiayi Deng | Yanmei Gu | Yiming Zhang | Jia Zhou | Yanfei Zhang | Junbo Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lirong Gao | Zeqing Wang | Yuyan Cai | Jiayi Deng | Yanmei Gu | Yiming Zhang | Jia Zhou | Yanfei Zhang | Junbo Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills—such as evidentiary reasoning—that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system—a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
FinMRAGBench: A Realistic and Complex Benchmark for Multi-Modal RAG in Financial Document Analysis
Shouqing Yang | Qi Zhang | Yuhang Yang | Ruikang Xu | Yuwei Hou | Zhulin Jia | Lirong Gao | Haobo Wang | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Shouqing Yang | Qi Zhang | Yuhang Yang | Ruikang Xu | Yuwei Hou | Zhulin Jia | Lirong Gao | Haobo Wang | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Gang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for realistic financial analysis over financial documents. However, existing benchmarks fail to capture realistic financial analysis settings that involve cross-document retrieval, multi-page evidence integration, and diverse analytical tasks. To address this gap, we introduce FinMRAGBench, a comprehensive multi-modal financial RAG benchmark in which most questions require retrieving evidence scattered across multiple pages and documents, constructed from large-scale real-world annual reports and comprising 887 expert-verified QA pairs spanning five representative financial analysis tasks. Moreover, we introduce FinMRAGAgent, an agent trained on high-quality agentic trajectories following the reasoning-and-acting (ReAct) paradigm, capable of dynamic tool invocation and multi-step financial analysis. Our extensive experiments show that current multi-modal RAG systems still struggle with incomplete retrieval and complex financial reasoning. In contrast, FinMRAGAgent achieves the strongest overall performance across all models, demonstrating that our structured reasoning approach significantly enhances multi-modal RAG in realistic financial scenarios. The code and data are available at https://github.com/sqyangit/FinMRAGBench.
Towards Interpretable Tabular Reasoning: Enhancing LLM Reasoning on Tabular Data with Pre-Constructed Logic Graph
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tabular data is widely used in fields such as finance and healthcare. Traditional tree-based models are prevalent for tabular prediction tasks due to their ability to handle heterogeneous features. However, their heavy reliance on feature engineering limits both their generalizability and their human-readable interpretability. On the other hand, Large Language Models (LLMs) naturally provide intermediate reasoning steps, thus offering greater transparency in decision-making. Nevertheless, LLMs often fail to match the predictive performance of tree-based models on tabular data. To address these challenges, we propose a novel Logic-Graph-Enhanced LLM Reasoning (LogGER) framework that integrates the strengths of tree-based models and LLMs. Specifically, we reformulate the traditional decision tree as a human-readable logic graph, which explicitly models the causal relationships between features and targets. This logic graph is automatically constructed using LLMs based on data priors and serves as the foundation for LogGER. To fully leverage the logic graph, we further introduce a logic-graph-guided process supervision approach, which evaluates and enhances the quality of LLM’s intermediate reasoning steps using logic-graph-aided process reward. Extensive experiments demonstrate that LogGER consistently outperforms both tree-based models and state-of-the-art LLM methods on a variety of tabular prediction tasks, achieving superior accuracy and interpretability.
2025
LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
Qi Zhang | Shouqing Yang | Lirong Gao | Hao Chen | Xiaomeng Hu | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Haobo Wang | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Qi Zhang | Shouqing Yang | Lirong Gao | Hao Chen | Xiaomeng Hu | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Haobo Wang | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose **Le**arning to **T**hink-and-**S**earch (**LeTS**), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of **LeTS** across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs’ reasoning ability via RL under other scenarios.
D.Va: Validate Your Demonstration First Before You Use It
Qi Zhang | Zhiqing Xiao | Ruixuan Xiao | Lirong Gao | Junbo Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qi Zhang | Zhiqing Xiao | Ruixuan Xiao | Lirong Gao | Junbo Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In-context learning (ICL) has demonstrated significant potential in enhancing the capabilities of large language models (LLMs) during inference. It’s well-established that ICL heavily relies on selecting effective demonstrations to achieve outputs that better align with the expected results. As for demonstration selection, previous approaches have typically relied on intuitive metrics to evaluate the effectiveness of demonstrations, which often results in limited robustness and poor cross-model generalization capabilities. To tackle these challenges, we propose a novel method, **D**emonstration **Va**lidation (**D.Va**), which integrates a demonstration validation perspective into this field. By introducing the demonstration validation mechanism, our method effectively identifies demonstrations that are both effective and highly generalizable. **D.Va** surpasses all existing retrieval-based in-context learning techniques across both natural language understanding (NLU) and natural language generation (NLG) tasks. Additionally, we demonstrate the robustness and generalizability of our approach across various language models and retrieval models.
ALPS: Attention Localization and Pruning Strategy for Efficient Adaptation of Large Language Models
Hao Chen | Haoze Li | Zhiqing Xiao | Lirong Gao | Qi Zhang | Xiaomeng Hu | Ningtao Wang | Xing Fu | Junbo Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Hao Chen | Haoze Li | Zhiqing Xiao | Lirong Gao | Qi Zhang | Xiaomeng Hu | Ningtao Wang | Xing Fu | Junbo Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the Attention Localization and Pruning Strategy ALPS, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only 10% of attention parameters during fine-tuning while achieving a 2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.
2024
DORY: Deliberative Prompt Recovery for LLM
Lirong Gao | Ru Peng | Yiming Zhang | Junbo Zhao
Findings of the Association for Computational Linguistics: ACL 2024
Lirong Gao | Ru Peng | Yiming Zhang | Junbo Zhao
Findings of the Association for Computational Linguistics: ACL 2024
Prompt recovery in large language models (LLMs) is crucial for understanding how LLMs work and addressing concerns regarding privacy, copyright, etc. The trend towards inference-only APIs complicates this task by restricting access to essential outputs for recovery. To tackle this challenge, we extract prompt-related information from limited outputs and identify a strong(negative) correlation between output probability-based uncertainty and the success of prompt recovery.This finding led to the development of Deliberative PrOmpt RecoverY (DORY), our novel approach that leverages uncertainty to recover prompts accurately. DORY involves reconstructing drafts from outputs, refining these with hints, and filtering out noise based on uncertainty. Our evaluation shows that DORY outperforms existing baselines across diverse LLMs and prompt benchmarks, improving performance by approximately 10.82% and establishing a new state-of-the-art record in prompt recovery tasks. Significantly, DORY operates using a single LLM without any external resources or model, offering a cost-effective, user-friendly prompt recovery solution.
Search
Fix author
Co-authors
- Junbo Zhao 6
- Sheng Guo 3
- Haobo Wang 3
- Qi Zhang 3
- Jinglei Chen 2
- Gang Chen 2
- Xiaomeng Hu 2
- Jiexiang Wang 2
- Zhiqing Xiao 2
- Shouqing Yang 2
- Qi Zhang 2
- Bo Zheng 2
- Yuyan Cai 1
- Hao Chen 1
- Hao Chen 1
- Jiayi Deng 1
- Xing Fu 1
- Yanmei Gu 1
- Yuwei Hou 1
- Zhulin Jia 1
- Haoze Li 1
- Ru Peng 1
- Zeqing Wang 1
- Ningtao Wang 1
- Ruixuan Xiao 1
- Ruikang Xu 1
- Yuhang Yang 1
- Zhongrui Yin 1
- Zewei Yu 1
- Yiming Zhang 1
- Yanfei Zhang 1
- Yiming Zhang 1
- Bo Zheng 1
- Jia Zhou 1
- Yuke Zhu 1