Haishuai Wang

2025

Document question answering plays a crucial role in enhancing employee productivity by providing quick and accurate access to information. Two primary approaches have been developed: retrieval-augmented generation (RAG), which reduces input tokens and inference costs, and long-context question answering (LC), which processes entire documents for higher accuracy. We introduce EXPLAIN (EXtracting, Pre-summarizing, Linking and enhAcINg RAG), a novel retrieval-augmented generation method that automatically extracts useful entities and generates summaries from documents. EXPLAIN improves accuracy by retrieving more informative entity summaries, achieving precision comparable to LC while maintaining low token consumption. Experimental results on internal dataset (ROUGE-L from 30.14% to 30.31%) and three public datasets (HotpotQA, 2WikiMQA, and Quality, average score from 62% to 64%) demonstrate the efficacy of EXPLAIN. Human evaluation in ant group production deployment indicates EXPLAIN surpasses baseline RAG in comprehensiveness.

While Large Language Models (LLMs) have exhibited impressive performance in generating long-form content, they frequently present a hazard of producing factual inaccuracies or hallucinations. An effective strategy to mitigate this hazard is to leverage off-the-shelf LLMs to detect hallucinations after the generation. The primary challenge resides in the comprehensive elicitation of the intrinsic knowledge acquired during their pre-training phase. However, existing methods that employ multi-step reasoning chains predominantly fall short of addressing this issue. Moreover, since existing methods for hallucination detection tend to decompose text into isolated statements, they are unable to understand the contextual semantic relations in long-form content. In this paper, we study a novel concept, self-elicitation, to leverage self-generated thoughts derived from prior statements as catalysts to elicit the expression of intrinsic knowledge and understand contextual semantics. We present a framework, SelfElicit, to integrate self-elicitation with graph structures to effectively organize the elicited knowledge and facilitate factual evaluations. Extensive experiments on five datasets in various domains demonstrate the effectiveness of self-elicitation and the superiority of our proposed method.

2024

pdf bib abs
Matching Varying-Length Texts via Topic-Informed and Decoupled Sentence Embeddings
Xixi Zhou | Chunbin Gu | Xin Jie | Jiajun Bu | Haishuai Wang
Findings of the Association for Computational Linguistics: NAACL 2024

Measuring semantic similarity between texts is a crucial task in natural language processing. While existing semantic text matching focuses on pairs of similar-length sequences, matching texts with non-comparable lengths has broader applications in specific domains, such as comparing professional document summaries and content. Current approaches struggle with text pairs of non-comparable lengths due to truncation issues. To address this, we split texts into natural sentences and decouple sentence representations using supervised contrastive learning (SCL). Meanwhile, we adopt the embedded topic model (ETM) for specific domain data. Our experiments demonstrate the effectiveness of our model, based on decoupled and topic-informed sentence embeddings, in matching texts of significantly different lengths across three well-studied datasets.

Co-authors

Xin Jie 1

Yong Li 1

Venues

findings2
acl1

Fix author