Kun Zhu


2024

pdf
An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation
Kun Zhu | Xiaocheng Feng | Xiyuan Du | Yuxuan Gu | Weijiang Yu | Haotian Wang | Qianglong Chen | Zheng Chu | Jingchang Chen | Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-augmented generation integrates the capabilities of large language models with relevant information retrieved from an extensive corpus, yet encounters challenges when confronted with real-world noisy data. One recent solution is to train a filter module to find relevant content but only achieve suboptimal noise compression. In this paper, we propose to introduce the information bottleneck theory into retrieval-augmented generation. Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output, while minimizing the mutual information between compression and retrieved passage. In addition, we derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations, the selection of supervised fine-tuning data, and the construction of reinforcement learning rewards. Experimental results demonstrate that our approach achieves significant improvements across various question answering datasets, not only in terms of the correctness of answer generation but also in the conciseness with 2.5% compression rate.

pdf
BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering
Zheng Chu | Jingchang Chen | Qianglong Chen | Haotian Wang | Kun Zhu | Xiyuan Du | Weijiang Yu | Ming Liu | Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated strong reasoning capabilities.Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks.Retrieval-augmented reasoning represents a promising approach.However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge.To address this, we propose Beam Aggregation Reasoning (BeamAggR), a reasoning framework for knowledge-intensive multi-hop QA.BeamAggR explores and prioritizes promising answers at each hop of question.Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning.For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates.For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory.Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%.Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.

2023

pdf
Hierarchical Catalogue Generation for Literature Review: A Benchmark
Kun Zhu | Xiaocheng Feng | Xiachong Feng | Yingsheng Wu | Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2023

Scientific literature review generation aims to extract and organize important information from an abundant collection of reference papers and produces corresponding reviews while lacking a clear and logical hierarchy. We observe that a high-quality catalogue-guided generation process can effectively alleviate this problem. Therefore, we present an atomic and challenging task named Hierarchical Catalogue Generation for Literature Review as the first step for review generation, which aims to produce a hierarchical catalogue of a review paper given various references. We construct a novel English Hierarchical Catalogues of Literature Reviews Dataset with 7.6k literature review catalogues and 389k reference papers. To accurately assess the model performance, we design two evaluation metrics for informativeness and similarity to ground truth from semantics and structure. Our extensive analyses verify the high quality of our dataset and the effectiveness of our evaluation metrics. We further benchmark diverse experiments on state-of-the-art summarization models like BART and large language models like ChatGPT to evaluate their capabilities. We further discuss potential directions for this task to motivate future research.