Chengbo Zhang
2025
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework
Wenhao Liu
|
Zhenyi Lu
|
Xinyu Hu
|
Jerry Zhang
|
Dailin Li
|
Jiacheng Cen
|
Huilin Cao
|
Haiteng Wang
|
Yuhan Li
|
Xie Kun
|
Dandan Li
|
Pei Zhang
|
Chengbo Zhang
|
Yuxiang Ren
|
Xiaohong Huang
|
Yan Ma
Findings of the Association for Computational Linguistics: ACL 2025
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation.To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues.To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians’ evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems.Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B).As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.