Jiaxin Mao


2025

pdf bib
Towards Effective and Efficient Continual Pre-training of Large Language Models
Jie Chen | Zhipeng Chen | Jiapeng Wang | Kun Zhou | Yutao Zhu | Jinhao Jiang | Yingqian Min | Xin Zhao | Zhicheng Dou | Jiaxin Mao | Yankai Lin | Ruihua Song | Jun Xu | Xu Chen | Rui Yan | Zhewei Wei | Di Hu | Wenbing Huang | Ji-Rong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. In this paper, we comprehensively study its key designs to balance the new abilities while retaining the original abilities, and present an effective CPT method that can greatly improve the Chinese language ability and scientific reasoning ability of LLMs. To achieve it, we design specific data mixture and curriculum strategies based on existing datasets and synthetic high-quality data. Concretely, we synthesize multidisciplinary scientific QA pairs based on related web pages to guarantee the data quality, and also devise the performance tracking and data mixture adjustment strategy to ensure the training stability. For the detailed designs, we conduct preliminary studies on a relatively small model, and summarize the findings to help optimize our CPT method. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of Llama-3 (8B), including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval). Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.

2024

pdf bib
Prompt Refinement with Image Pivot for Text-to-Image Generation
Jingtao Zhan | Qingyao Ai | Yiqun Liu | Yingwei Pan | Ting Yao | Jiaxin Mao | Shaoping Ma | Tao Mei
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from “user languages” into “system languages”. However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary “pivot” between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.