Yuyao Zhang


2026

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards.To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem.In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50\\%$ compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT’s responses, which are classified as not using thinking, remains below $10\\%$ across all tested datasets.
Professional financial reports serve as the cornerstone of investment decisions, demanding deep reasoning and multimodal synthesis. While recent deep research systems excel in open-domain search, they struggle with financial reporting, specifically in handling financial data, ensuring analytical depth, and integrating professional visualizations. To address this, we introduce FinSight , the first multi-agent framework for automate end-to-end professional, multimodal financial report. At its core, we propose the Code Agent with Variable Memory architecture, which unifies data, tools, and agents into a programmable variable space, enabling flexible data manipulation and reasoning through executable code. To guarantee report quality, FinSight incorporates a Two-Stage Writing Framework with Generative Retrieval. This mechanism first distills raw data into structured Chain-of-Analysis segments, and then progressively synthesizes them into a coherent, citation-aware, and multimodal narrative. Additionally, an Iterative Vision-Enhanced Mechanism leverages visual feedback to refine code-generated charts to expert standards. Experiments on company and industry-level tasks demonstrate that FinSight significantly outperforms leading deep research systems in factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating professional financial reports. Our code is available at https://anonymous.4open.science/r/FinSight-5841.
Recent advances in large language models (LLMs) have enabled web agents to perform interactive tasks on real-world websites. However, existing agents still suffer from limited robustness, efficiency, and task success, largely due to their lack of structural understanding of websites and the absence of browsing priors in pre-trained models. To address these challenges, this paper proposes the Web Agent Sitemap Protocol (WASP), an agent-oriented sitemap that integrate structured website knowledge into web agents. WASP adopts a dual-granularity design, providing global site-level structure and local page-level semantic and interaction guidance. We also introduce a framework LightASM for constructing such sitemaps by identifying core pages and generating concise semantic summaries and block-level descriptions. Experiments on real-world browsing benchmarks demonstrate that WASP substantially improves the robustness, efficiency, and effectiveness of LLM-based web agents without extra training.

2025

Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents **QCompiler**, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically presents a minimal yet sufficient Backus-Naur Form (BNF) grammar G[q] to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a query expression translator, a Lexical syntax parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system’s ability to address complex queries.
Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework (PD3F), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing computing resource usage induced by malicious prompts under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which reduces excessive malicious generation. Experiments across six models demonstrate that PD3F significantly mitigates resource consumption attacks, improving users’ access capacity by up to 500% during adversarial load. PD3F represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce **Search-o1**, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness of LRMs in complex reasoning tasks, paving the way for advanced deep research systems. The code is available at https://github.com/RUC-NLPIR/Search-o1.

2024

Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs) underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspired by human judicial reasoning. ADAPT involves decomposing case facts, discriminating among potential charges, and predicting the final judgment. We further enhance LLMs through fine-tuning with multi-task synthetic trajectories to improve legal judgment prediction accuracy and efficiency under our ADAPT framework. Extensive experiments conducted on two widely-used datasets demonstrate the superior performance of our framework in legal judgment prediction, particularly when dealing with complex and confusing charges.