2025
pdf
bib
abs
Towards Effective and Efficient Continual Pre-training of Large Language Models
Jie Chen
|
Zhipeng Chen
|
Jiapeng Wang
|
Kun Zhou
|
Yutao Zhu
|
Jinhao Jiang
|
Yingqian Min
|
Xin Zhao
|
Zhicheng Dou
|
Jiaxin Mao
|
Yankai Lin
|
Ruihua Song
|
Jun Xu
|
Xu Chen
|
Rui Yan
|
Zhewei Wei
|
Di Hu
|
Wenbing Huang
|
Ji-Rong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. In this paper, we comprehensively study its key designs to balance the new abilities while retaining the original abilities, and present an effective CPT method that can greatly improve the Chinese language ability and scientific reasoning ability of LLMs. To achieve it, we design specific data mixture and curriculum strategies based on existing datasets and synthetic high-quality data. Concretely, we synthesize multidisciplinary scientific QA pairs based on related web pages to guarantee the data quality, and also devise the performance tracking and data mixture adjustment strategy to ensure the training stability. For the detailed designs, we conduct preliminary studies on a relatively small model, and summarize the findings to help optimize our CPT method. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of Llama-3 (8B), including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval). Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.
pdf
bib
abs
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs
Zhicheng Guo
|
Sijie Cheng
|
Yuchen Niu
|
Hao Wang
|
Sicheng Zhou
|
Wenbing Huang
|
Yang Liu
Findings of the Association for Computational Linguistics: ACL 2025
The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scale, and realism, particularly for benchmarking purposes. To address this, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as “mirrors” to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.
2021
pdf
bib
abs
Knowledge Representation Learning with Contrastive Completion Coding
Bo Ouyang
|
Wenbing Huang
|
Runfa Chen
|
Zhixing Tan
|
Yang Liu
|
Maosong Sun
|
Jihong Zhu
Findings of the Association for Computational Linguistics: EMNLP 2021
Knowledge representation learning (KRL) has been used in plenty of knowledge-driven tasks. Despite fruitfully progress, existing methods still suffer from the immaturity on tackling potentially-imperfect knowledge graphs and highly-imbalanced positive-negative instances during training, both of which would hinder the performance of KRL. In this paper, we propose Contrastive Completion Coding (C3), a novel KRL framework that is composed of two functional components: 1. Hierarchical Architecture, which integrates both low-level standalone features and high-level topology-aware features to yield robust embedding for each entity/relation. 2. Normalized Contrasitive Training, which conducts normalized one-to-many contrasitive learning to emphasize different negatives with different weights, delivering better convergence compared to conventional training losses. Extensive experiments on several benchmarks verify the efficacy of the two proposed techniques and combing them together generally achieves superior performance against state-of-the-art approaches.