Yanjun Shao
2026
SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science
Wonduk Seo | Juhyeon Lee | Yanjun Shao | Qingshan Zhou | Seunghyun Lee | Yi Bu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wonduk Seo | Juhyeon Lee | Yanjun Shao | Qingshan Zhou | Seunghyun Lee | Yi Bu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have enabled dynamic reasoning in automated data analytics, yet recent multi-agent systems remain limited by rigid, single-path workflows that restrict strategic exploration and often lead to suboptimal outcomes. To overcome these limitations, we propose SPIO (Sequential Plan Integration and Optimization), a framework that replaces rigid workflows with adaptive, multi-path planning across four core modules: data preprocessing, feature engineering, model selection, and hyperparameter tuning. In each module, specialized agents generate diverse candidate strategies, which are cascaded and refined by an optimization agent. SPIO offers two operating modes: SPIO-S for selecting a single optimal pipeline, and SPIO-E for ensembling top-k pipelines to maximize robustness. Extensive evaluations on Kaggle and OpenML benchmarks show that SPIO consistently outperforms state-of-the-art baselines, achieving an average performance gain of 5.6%. By explicitly exploring and integrating multiple solution paths, SPIO delivers a more flexible, accurate, and reliable foundation for automated data science.
2024
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes
He Cao | Yanjun Shao | Zhiyuan Liu | Zijing Liu | Xiangru Tang | Yuan Yao | Yu Li
Findings of the Association for Computational Linguistics: EMNLP 2024
He Cao | Yanjun Shao | Zhiyuan Liu | Zijing Liu | Xiangru Tang | Yuan Yao | Yu Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multi-molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO (Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.