Yichuan Ma (马逸川) - ACL Anthology

Yichuan Ma

Also published as: 逸川马

2025

Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data.

Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs, which can be challenging to scale due to the dependence on a teacher model and high generation costs. In this paper, we focus on synthesizing code data at scale and propose a Case2Code task by exploiting the expressiveness and correctness of programs. Case2Code is an inductive inference task that aims to infer underlying code implementations by observing input-output examples or program behaviors, By incorporating LLMs to generate program inputs, and executing the program with these inputs to obtain the program outputs, we can synthesize diverse and high-quality Case2Code data at scale for training and evaluating code LLMs. Experimental results show that case-to-code induction is challenging for current representative LLMs if they are untrained. Models trained with Case2Code improve performance not only on distribution case-to-code induction but also various coding-generation tasks, demonstrating the great potential of large-scale synthetic data and inductive learning.

2024

pdf bib abs
大语言模型合成数据方法简述(A Brief Introduction to Synthetic Data for Large Language Model)
Peiji Li (李培基) | Yichuan Ma (马逸川) | Hang Yan (航颜)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)

“大语言模型在过去两年受到了极大的关注,并引起了对通用人工智能的广泛讨论。为了实现通用人工智能,合成数据被认为是其中非常关键的一环。本文将当前常见的数据合成方法归为三类,基于蒸馏的合成数据、基于模型自我进化、基于工具的合成数据。针对每一类合成数据方法,我们简要介绍了几种主流的做法,以期概览各类方法的基本思路以及异同。当前大部分合成数据方法都基于蒸馏,尽管这些方法取得了良好的效果,但其实质是将更强的大模型蒸馏到更小的大模型。这样的方法从降低大模型推理成本的角度具有实际意义,但对于进一步提升大模型能力上限作用有限。基于模型自我进化和基于工具的合成数据研究相对偏少,对于持续提升模型能力,这两个方向需要有更多探索。”

Co-authors

Hang Yan (航颜) 2

Qinyuan Cheng 1

Xuan-Jing Huang (黄萱菁) 1

Kai Lv 1

Venues

Fix author