Shuyu Gan

2026

Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), an iterative refinement process guided by reward signals.However, many real-world tasks involve multi-stage pipelines whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to error accumulation across stages.We propose Selective TTS, a process-based refinement framework that scales inference across stages of a multi-agent pipeline, instead of repeatedly refining a single output over time as in prior work.By distributing compute across stages and pruning low-quality branches early using process-specific judgers, Selective TTS mitigates the judge drift and stabilizes refinement.Grounded in a data science workflow, we build an end-to-end multi-agent pipeline for generating visually insightful reports from a given dataset, and design a reliable LLM-based judge model that aligns with human experts (Kendall’s 𝜏=0.55) to evaluate them.Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 (baseline) to 65.86 while reducing variance.We hope our findings serve as the first step toward scaling complex, open-ended tasks with unverifiable rewards like scientific discovery. Our code and generated reports are publicly available at https://minnesotanlp.github.io/insight-scaling-webpage.

Co-authors

Qianwen Wang 1

Venues

Findings1

Fix author