Qi Chen
Other people with similar names: Qi Chen , Qi Chen
2025
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo
|
Xin Zhang
|
Xiao Liu
|
Haoling Li
|
Yeyun Gong
|
Qi Chen
|
Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
Alleviating Performance Degradation Caused by Out-of-Distribution Issues in Embedding-Based Retrieval
Haotong Bao
|
Jianjin Zhang
|
Qi Chen
|
Weihao Han
|
Zhengxin Zeng
|
Ruiheng Chang
|
Mingzheng Li
|
Hao Sun
|
Weiwei Deng
|
Feng Sun
|
Qi Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
In Embedding Based Retrieval (EBR), Approximate Nearest Neighbor (ANN) algorithms are widely adopted for efficient large-scale search. However, recent studies reveal a query out-of-distribution (OOD) issue, where query and base embeddings follow mismatched distributions, significantly degrading ANN performance. In this work, we empirically verify the generality of this phenomenon and provide a quantitative analysis. To mitigate the distributional gap, we introduce a distribution regularizer into the encoder training objective, encouraging alignment between query and base embeddings. Extensive experiments across multiple datasets, encoders, and ANN indices show that our method consistently improves retrieval performance.
Search
Fix author
Co-authors
- Haotong Bao 1
- Ruiheng Chang 1
- Peng Cheng 1
- Weiwei Deng 1
- Yeyun Gong 1
- show all...