2025
pdf
bib
abs
FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving
Guizhen Chen
|
Weiwen Xu
|
Hao Zhang
|
Hou Pong Chan
|
Chaoqun Liu
|
Lidong Bing
|
Deli Zhao
|
Anh Tuan Luu
|
Yu Rong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the “System 1” way of quick reactions to the “System 2” style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model’s intermediate reasoning steps unexamined. This fails to assess the model’s ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for systematic evaluation of LLMs’ reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing general reasoning. We show that models trained on our state checking and transition data demonstrate gains in mathematical reasoning by up to 5.1%.
pdf
bib
abs
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Wenjun Li
|
Dexun Li
|
Kuicai Dong
|
Cong Zhang
|
Hao Zhang
|
Weiwen Liu
|
Yasheng Wang
|
Ruiming Tang
|
Yong Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model’s awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs’ internal cognitive signals and significantly improves tool-use decision-making.
pdf
bib
abs
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models
Xiangyang Li
|
Kuicai Dong
|
Yi Quan Lee
|
Wei Xia
|
Hao Zhang
|
Xinyi Dai
|
Yasheng Wang
|
Ruiming Tang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Moreover, many models have begun to overfit existing leaderboards, limiting their generalizability and real-world applicability. Addressing this gap, we present CoIR (**Co**de **I**nformation **R**etrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. CoIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of CoIR and its diverse dataset composition. Further, we evaluate ten widely used retrieval models using CoIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. CoIR also introduces a simple yet effective python framework, which additionally defines various advanced modes to facilitate researchers in evaluating their models. It shares the same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through CoIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems.
pdf
bib
abs
Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Chenghao Xiao
|
Hou Pong Chan
|
Hao Zhang
|
Mahani Aljunied
|
Lidong Bing
|
Noura Al Moubayed
|
Yu Rong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at
https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
pdf
bib
abs
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Yu Sun
|
Xingyu Qian
|
Weiwen Xu
|
Hao Zhang
|
Chenghao Xiao
|
Long Li
|
Deli Zhao
|
Wenbing Huang
|
Tingyang Xu
|
Qifeng Bai
|
Yu Rong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts.To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline.ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier.Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential.The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
pdf
bib
abs
CtrlA: Adaptive Retrieval-Augmented Generation via Inherent Control
Liu Huanshuo
|
Hao Zhang
|
Zhijiang Guo
|
Jing Wang
|
Kuicai Dong
|
Xiangyang Li
|
Yi Quan Lee
|
Cong Zhang
|
Yong Liu
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by enabling dynamic retrieval during generation, activating retrieval only when the query exceeds LLM’s internal knowledge. Existing methods primarily focus on detecting LLM’s confidence via statistical uncertainty. Instead, we present the first attempts to solve adaptive RAG from a representation perspective and develop an inherent control-based framework, termed CtrlA. Specifically, we extract the features that represent the honesty and confidence directions of LLM and adopt them to control LLM behavior and guide retrieval timing decisions. We also design a simple yet effective query formulation strategy to support adaptive retrieval. Experiments show that CtrlA is superior to existing adaptive RAG methods on a diverse set of tasks. Honesty steering can effectively make LLMs more honest and confidence monitoring is a promising indicator of retrieval trigger.
pdf
bib
abs
RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery
Hongchao Gu
|
Dexun Li
|
Kuicai Dong
|
Hao Zhang
|
Hang Lv
|
Hao Wang
|
Defu Lian
|
Yong Liu
|
Enhong Chen
Findings of the Association for Computational Linguistics: ACL 2025
Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient **R**etrieval-**A**ugmented long text generation framework with writing **P**lanning and **I**nformation **D**iscovery. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
pdf
bib
abs
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning
Guizhen Chen
|
Weiwen Xu
|
Hao Zhang
|
Hou Pong Chan
|
Deli Zhao
|
Anh Tuan Luu
|
Yu Rong
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, constraining RL reward signals for training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and problem-solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
pdf
bib
abs
Reverse Modeling in Large Language Models
Sicheng Yu
|
Xu Yuanchen
|
Cunxiao Du
|
Yanying Zhou
|
Minghui Qiu
|
Qianru Sun
|
Hao Zhang
|
Jiawei Wu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Humans are accustomed to reading and writing in a forward manner, and this natural bias extends to text understanding in auto-regressive large language models (LLMs). This paper investigates whether LLMs, like humans, struggle with reverse modeling, specifically with reversed text inputs. We found that publicly available pre-trained LLMs cannot understand such inputs. However, LLMs trained from scratch with both forward and reverse texts can understand them equally well during inference across multiple languages.Our case study shows that different-content texts result in different losses if input (to LLMs) in different directions—some get lower losses for forward while some for reverse. This leads us to a simple and nice solution for data selection based on the loss differences between forward and reverse directions. Using our selected data in continued pretraining can boost LLMs’ performance by a large margin across different language understanding benchmarks.
2024
pdf
bib
abs
Parameter-Efficient Conversational Recommender System as a Language Processing Task
Mathieu Ravaut
|
Hao Zhang
|
Lu Xu
|
Aixin Sun
|
Yong Liu
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation. Prior work often utilizes external knowledge graphs for items’ semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items. This combination of multiple components suffers from a cumber-some training process, and leads to semantic misalignment issues between dialogue generation and item recommendation. In this paper, we represent items in natural language and formulate CRS as a natural language processing task. Accordingly, we leverage the power of pre-trained language models to encode items, understand user intent via conversation, perform item recommendation through semantic matching, and generate dialogues. As a unified model, our PECRS (Parameter-Efficient CRS), can be optimized in a single stage, without relying on non-textual metadata such as a knowledge graph. Experiments on two benchmark CRS datasets, ReDial and INSPIRED, demonstrate the effectiveness of PECRS on recommendation and conversation. Our code is available at: https://github.com/Ravoxsg/efficient_unified_crs.
pdf
bib
abs
DVD: Dynamic Contrastive Decoding for Knowledge Amplification in Multi-Document Question Answering
Jing Jin
|
Houfeng Wang
|
Hao Zhang
|
Xiaoguang Li
|
Zhijiang Guo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are widely used in question-answering (QA) systems but often generate information with hallucinations. Retrieval-augmented generation (RAG) offers a potential remedy, yet the uneven retrieval quality and irrelevant contents may distract LLMs.In this work, we address these issues at the generation phase by treating RAG as a multi-document QA task.We propose a novel decoding strategy, Dynamic Contrastive Decoding, which dynamically amplifies knowledge from selected documents during the generation phase. involves constructing inputs batchwise, designing new selection criteria to identify documents worth amplifying, and applying contrastive decoding with a specialized weight calculation to adjust the final logits used for sampling answer tokens. Zero-shot experimental results on ALCE-ASQA, NQ, TQA and PopQA benchmarks show that our method outperforms other decoding strategies. Additionally, we conduct experiments to validate the effectiveness of our selection criteria, weight calculation, and general multi-document scenarios. Our method requires no training and can be integrated with other methods to improve the RAG performance. Our codes will be publicly available at https://github.com/JulieJin-km/Dynamic_Contrastive_Decoding.
pdf
bib
abs
MC-indexing: Effective Long Document Retrieval via Multi-view Content-aware Indexing
Kuicai Dong
|
Derrick Goh Xin Deik
|
Yi Quan Lee
|
Hao Zhang
|
Xiangyang Li
|
Cong Zhang
|
Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the **M**ulti-view **C**ontent-aware indexing (**MC-indexing**) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by **42.8%**, **30.0%**, **23.9%**, and **16.3%** via top k = 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
pdf
bib
LONG2RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Zehan Qi
|
Rongwu Xu
|
Zhijiang Guo
|
Cunxiang Wang
|
Hao Zhang
|
Wei Xu
Findings of the Association for Computational Linguistics: EMNLP 2024
2023
pdf
bib
abs
MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction
Wang Jing
|
Aixin Sun
|
Hao Zhang
|
Xiaoli Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Given a text query, the task of Natural Language Video Localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query. In this paper, we adopt a proposal-based solution that generates proposals (i.e. candidate moments) and then select the best matching proposal. On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling. The core idea is to sample a subset of moments guided by the learnable templates with an adopted DETR framework. To achieve this, we design a multi-scale visual-linguistic encoder, and an anchor-guided moment decoder paired with a set of learnable templates. Experimental results on three public datasets demonstrate the superior performance of MS-DETR.
2022
pdf
bib
abs
Translate-Train Embracing Translationese Artifacts
Sicheng Yu
|
Qianru Sun
|
Hao Zhang
|
Jing Jiang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Translate-train is a general training approach to multilingual tasks. The key idea is to use the translator of the target language to generate training data to mitigate the gap between the source and target languages. However, its performance is often hampered by the artifacts in the translated texts (translationese). We discover that such artifacts have common patterns in different languages and can be modeled by deep learning, and subsequently propose an approach to conduct translate-train using Translationese Embracing the effect of Artifacts (TEA). TEA learns to mitigate such effect on the training data of a source language (whose original and translationese are both available), and applies the learned module to facilitate the inference on the target language. Extensive experiments on the multilingual QA dataset TyDiQA demonstrate that TEA outperforms strong baselines.