2024
pdf
abs
MIMIR: A Customizable Agent Tuning Platform for Enhanced Scientific Applications
Xiangru Tang
|
Chunyuan Deng
|
Hanminwang Hanminwang
|
Haoran Wang
|
Yilun Zhao
|
Wenqi Shi
|
Yi Fung
|
Wangchunshu Zhou
|
Jiannan Cao
|
Heng Ji
|
Arman Cohan
|
Mark Gerstein
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across various tasks. However, without agent-tuning, open-source models like LLaMA2 currently struggle to match the efficiency of larger models such as GPT-4 in scientific applications due to a lack of agent tuning datasets. In response, we introduce MIMIR, a streamlined platform that leverages large LLMs to generate agent-tuning data for fine-tuning smaller, specialized models. By employing a role-playing methodology, MIMIR enables larger models to simulate various roles and create interaction data, which can then be used to fine-tune open-source models like LLaMA2. This approach ensures that even smaller models can effectively serve as agents in scientific tasks. Integrating these features into an end-to-end platform, MIMIR facilitates everything from the uploading of scientific data to one-click agent fine-tuning. MIMIR is publicly released and actively maintained at https://github. com/gersteinlab/MIMIR, along with a demo video for quick-start, calling for broader development.
pdf
abs
RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Yanming Liu
|
Xinyue Peng
|
Xuhong Zhang
|
Weihao Liu
|
Jianwei Yin
|
Jiannan Cao
|
Tianyu Du
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn’t previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model’s problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.
pdf
abs
Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation
Chunyuan Deng
|
Yilun Zhao
|
Yuzhao Heng
|
Yitong Li
|
Jiannan Cao
|
Xiangru Tang
|
Arman Cohan
Findings of the Association for Computational Linguistics: ACL 2024
Data contamination has garnered increased attention in the era of Large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks—referred to as contamination—has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present the first survey in the field of data contamination. We begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.