Yuling Shi

2026

Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the “Aha Moment” phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

pdf bib abs

Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider document reconstruction from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

pdf bib abs

Multi-agent systems powered by large language models have achieved strong performance on complex tasks, yet naive collaboration topologies often cause high communication costs and redundant context. Existing methods usually use a fixed communication graph and manage collaboration structure and shared memory in separate modules. Our log analysis of several representative systems shows that this separation leads to multiple copies of the same key facts in dialogue, memory and model inputs. We address this issue with EvoHyper, a framework based on an evolving hypergraph topology for multi-agent collaboration. In EvoHyper, a single hypergraph represents agents and shared memory, and each hyperedge serves as a collaboration unit that binds a group of agents to that shared memory. During execution a controller edits the hypergraph through a small set of predefined evolution operations, so collaboration units can spawn, update and merge as tasks unfold. Experiments on four benchmarks covering mathematical reasoning and code generation show that EvoHyper is (I) high-performing, achieving 3.2% to 7.8% accuracy gains over state-of-the-art methods, (II) efficient, reducing token consumption by up to 23.5%, and (III) adaptive, adjusting topology complexity according to task requirements.

pdf bib abs

Understanding and reasoning about entire soft-ware repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 720 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 12 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs.

2025

pdf bib abs

LastingBench: Defend Benchmarks Against Knowledge Leakage
Yixiong Fang | Tianran Sun | Yuling Shi | Min Wang | Xiaodong Gu
Findings of the Association for Computational Linguistics: EMNLP 2025

The increasing size and complexity of large language models (LLMs) raise concerns about their ability to “cheat” on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While existing methods detect such leakage, they fail to address the long-term challenge of mitigating it. In this paper, we introduce LastingBench, a novel approach to reinforce and safeguard existing benchmarks against knowledge leakage. Our method involves identifying leakage points through perturbation-based detection, followed by counterfactual rewriting to disrupt memorization while preserving the benchmark’s original evaluative intent. We demonstrate that our approach significantly reduces memorization effects in long-context QA benchmarks, providing a more accurate assessment of model reasoning and generalization abilities. Our experiments show that LastingBench not only uncovers substantial leakage in benchmarks like HotpotQA but also yields a more reliable evaluation of state-of-the-art models, ensuring that benchmarks remain effective and resilient over time.