Xuan Gao
2025
Improving Continual Pre-training Through Seamless Data Packing
Ruicheng Yin
|
Xuan Gao
|
Changze Lv
|
Xiaohua Wang
|
Xiaoqing Zheng
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baselines in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
2024
Searching for Best Practices in Retrieval-Augmented Generation
Xiaohua Wang
|
Zhenghua Wang
|
Xuan Gao
|
Feiran Zhang
|
Yixin Wu
|
Zhibo Xu
|
Tianyuan Shi
|
Zhengyuan Wang
|
Shizheng Li
|
Qi Qian
|
Ruicheng Yin
|
Changze Lv
|
Xiaoqing Zheng
|
Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” strategy.
Search
Fix author
Co-authors
- Xuan-Jing Huang (黄萱菁) 2
- Changze Lv 2
- Xiaohua Wang 2
- Ruicheng Yin 2
- Xiaoqing Zheng 2
- show all...