Jinyang Wu
2026
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Yuhao Shen | Tianyu Liu | Junyi Shen | Jinyang Wu | Quan Kong | Huan Li | Cong Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhao Shen | Tianyu Liu | Junyi Shen | Jinyang Wu | Quan Kong | Huan Li | Cong Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce Double (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval Precision-Efficiency Dilemma through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. Double is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of 5.3× on LLaMA3.3-70B and 2.8× on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training. Our code is available at https://github.com/Sylvan820/Double1.
Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng | Jinyang Wu | Siyuan Liu | Shuai Zhang | Hongjian Fang | Ruihan Jin | Feihu Che | Pengpeng Shao | Zhengqi Wen | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mingkuan Feng | Jinyang Wu | Siyuan Liu | Shuai Zhang | Hongjian Fang | Ruihan Jin | Feihu Che | Pengpeng Shao | Zhengqi Wen | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method **TRSP**: **T**wo-Stage **R**egularization-Based **S**tructured **P**runing for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their ℓ1-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.
Beyond Examples: Towards Automated Thought-level In-Context Reasoning for Large Language Models
Jinyang Wu | Mingkuan Feng | Shuai Zhang | Feihu Che | Zhengqi Wen | Chonghua Liao | Ling Yang | Haoran Luo | Zheng Lian | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinyang Wu | Mingkuan Feng | Shuai Zhang | Feihu Che | Zhengqi Wen | Chonghua Liao | Ling Yang | Haoran Luo | Zheng Lian | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In-context learning (ICL) leverages demonstrations to enhance the performance of large language models (LLMs). However, traditional ICL struggles with complex reasoning mainly due to superficial, example-level implicit imitation. To address these limitations, we introduce **ThoughtICR**, an automated **Thought**-level **I**n-**C**ontext **R**easoning paradigm that shifts from surface-level examples to more guidance-oriented thought patterns. Specifically, we first define atomic reasoning actions and construct thought patterns on small-scale seed data using Monte Carlo Tree Search (MCTS). During inference, we dynamically select appropriate thought patterns based on target problem attributes, providing explicit guidance for model reasoning. Thanks to its automated and strategic design, our method enables seamless plug-and-play integration with various post-training techniques. Experimental results demonstrate that our method improves performance across different model sizes and generalizes effectively across reasoning domains. Using only small-scale seed data, we achieve 80.6% accuracy on MATH and 62.5% on AMC, surpassing GPT-4o’s 77.2% and 57.5%, respectively. Moreover, compared to test-time scaling methods, our approach reduces computational costs by over 10. Our code is available at https://github.com/jinyangwu/ThoughtICR.
SPARK: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning
Jinyang Wu | Shuo Yang | Yuhao Shen | Shuai Zhang | Zhengqi Wen | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinyang Wu | Shuo Yang | Yuhao Shen | Shuai Zhang | Zhengqi Wen | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose **SPARK** (**S**trategic **P**olicy-**A**ware explo**R**ation via **K**ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent’s intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that **SPARK** achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios. Our code and checkpoints are available at https://github.com/jinyangwu/SPARK.
2025
RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing
Ruihan Jin | Pengpeng Shao | Zhengqi Wen | Jinyang Wu | Mingkuan Feng | Shuai Zhang | Jianhua Tao
Findings of the Association for Computational Linguistics: EMNLP 2025
Ruihan Jin | Pengpeng Shao | Zhengqi Wen | Jinyang Wu | Mingkuan Feng | Shuai Zhang | Jianhua Tao
Findings of the Association for Computational Linguistics: EMNLP 2025
The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present **RadialRouter**, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named **RadialFormer** to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2% and 5.8% in the *Balance* and *Cost First* scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.
Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data
Qiongqiong Wang | Hardik Bhupendra Sailor | Tianchi Liu | Wenyu Zhang | Muhammad Huzaifah | Nattadaporn Lertcheva | Shuo Sun | Nancy F. Chen | Jinyang Wu | AiTi Aw
Findings of the Association for Computational Linguistics: EMNLP 2025
Qiongqiong Wang | Hardik Bhupendra Sailor | Tianchi Liu | Wenyu Zhang | Muhammad Huzaifah | Nattadaporn Lertcheva | Shuo Sun | Nancy F. Chen | Jinyang Wu | AiTi Aw
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.
Pandora’s Box or Aladdin’s Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
Jinyang Wu | Shuai Zhang | Feihu Che | Mingkuan Feng | Pengpeng Shao | Jianhua Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinyang Wu | Shuai Zhang | Feihu Che | Mingkuan Feng | Pengpeng Shao | Jianhua Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability. In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks. Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance. Our analysis offers insights for developing robust RAG solutions and mitigating hallucinations across diverse retrieval scenarios. Code is available at https://github.com/jinyangwu/NoiserBench.
Search
Fix author
Co-authors
- Jianhua Tao 5
- Mingkuan Feng 4
- Zhengqi Wen 4
- Feihu Che 3
- Pengpeng Shao 3
- Shuai Zhang 3
- Ruihan Jin 2
- Yuhao Shen 2
- Shuai Zhang 2
- Aiti Aw 1
- Nancy Chen 1
- Hongjian Fang 1
- Muhammad Huzaifah 1
- Quan Kong 1
- Nattadaporn Lertcheva 1
- Huan Li 1
- Zheng Lian 1
- Chonghua Liao 1
- Siyuan Liu 1
- Tianchi Liu 1
- Tianyu Liu 1
- Haoran Luo 1
- Hardik Bhupendra Sailor 1
- Junyi Shen 1
- Shuo Sun 1
- Cong Wang 1
- Qiongqiong Wang 1
- Ling Yang 1
- Shuo Yang 1
- Wenyu Zhang 1