2025
pdf
bib
abs
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo
|
Xin Zhang
|
Xiao Liu
|
Haoling Li
|
Yeyun Gong
|
Qi Chen
|
Peng Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
2020
pdf
bib
abs
Chinese Grammatical Error Correction Based on Hybrid Models with Data Augmentation
Yi Wang
|
Ruibin Yuan
|
Yan‘gen Luo
|
Yufang Qin
|
NianYong Zhu
|
Peng Cheng
|
Lihuan Wang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
A better Chinese Grammatical Error Diagnosis (CGED) system for automatic Grammatical Error Correction (GEC) can benefit foreign Chinese learners and lower Chinese learning barriers. In this paper, we introduce our solution to the CGED2020 Shared Task Grammatical Error Correction in detail. The task aims to detect and correct grammatical errors that occur in essays written by foreign Chinese learners. Our solution combined data augmentation methods, spelling check methods, and generative grammatical correction methods, and achieved the best recall score in the Top 1 Correction track. Our final result ranked fourth among the participants.
2016
pdf
bib
Deceptive Review Spam Detection via Exploiting Task Relatedness and Unlabeled Data
Zhen Hai
|
Peilin Zhao
|
Peng Cheng
|
Peng Yang
|
Xiao-Li Li
|
Guangxia Li
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing