2025
pdf
bib
abs
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song
|
Saket Dingliwal
|
Sai Muralidhar Jayanthi
|
Bhavana Ganesh
|
Jinwoo Shin
|
Aram Galstyan
|
Sravan Babu Bodapati
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
pdf
bib
abs
Mamba Drafters for Speculative Decoding
Daewon Choi
|
Seunghyuk Oh
|
Saket Dingliwal
|
Jihoon Tack
|
Kyuyoung Kim
|
Woomin Song
|
Seojin Kim
|
Insu Han
|
Jinwoo Shin
|
Aram Galstyan
|
Shubham Katiyar
|
Sravan Babu Bodapati
Findings of the Association for Computational Linguistics: EMNLP 2025
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model’s distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
pdf
bib
abs
Think Clearly: Improving Reasoning via Redundant Token Pruning
Daewon Choi
|
Jimin Lee
|
Jihoon Tack
|
Woomin Song
|
Saket Dingliwal
|
Sai Muralidhar Jayanthi
|
Bhavana Ganesh
|
Jinwoo Shin
|
Aram Galstyan
|
Sravan Babu Bodapati
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves the performance through clear thinking (i.e., removing distraction). Specifically, we systematically identify such redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves the over all accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematics competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.
2024
pdf
bib
abs
Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization
Lei Xu
|
Mohammed Asad Karim
|
Saket Dingliwal
|
Aparna Elangovan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the effort required for summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on summary faithfulness is not universally positive across LLMs. To enable this approach, we introduce Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summarization systems.
pdf
bib
abs
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
Raghuveer Peri
|
Sai Muralidhar Jayanthi
|
Srikanth Ronanki
|
Anshu Bhatia
|
Karel Mundnich
|
Saket Dingliwal
|
Nilaksh Das
|
Zejiang Hou
|
Goeric Huybrechts
|
Srikanth Vishnubhotla
|
Daniel Garcia-Romero
|
Sundararajan Srinivasan
|
Kyu Han
|
Katrin Kirchhoff
Findings of the Association for Computational Linguistics: ACL 2024
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.
2023
pdf
bib
abs
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
Hritik Bansal
|
Karthik Gopalakrishnan
|
Saket Dingliwal
|
Sravan Bodapati
|
Katrin Kirchhoff
|
Dan Roth
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: ~70% of the attention heads and ~20% of the feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, reinforcing arguments by Olsson et al. (2022) regarding induction head generality to more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained for in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
pdf
bib
abs
Retrieve and Copy: Scaling ASR Personalization to Large Catalogs
Sai Muralidhar Jayanthi
|
Devang Kulshreshtha
|
Saket Dingliwal
|
Srikanth Ronanki
|
Sravan Bodapati
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Personalization of automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and/or domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usability. To address this, we first propose a “Retrieve and Copy” mechanism to improve latency while retaining the accuracy even when scaled to a large catalog. We also propose a training strategy to overcome the degradation in recall at such scale due to an increased number of confusing entities. Overall, our approach achieves up to 6% more Word Error Rate reduction (WERR) and 3.6% absolute improvement in F1 when compared to a strong baseline. Our method also allows for large catalog sizes of up to 20K without significantly affecting WER and F1-scores, while achieving at least 20% inference speedup per acoustic frame.
2021
pdf
bib
abs
Few Shot Dialogue State Tracking using Meta-learning
Saket Dingliwal
|
Shuyang Gao
|
Sanchit Agarwal
|
Chien-Wei Lin
|
Tagyoung Chung
|
Dilek Hakkani-Tur
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Dialogue State Tracking (DST) forms a core component of automated chatbot based systems designed for specific goals like hotel, taxi reservation, tourist information etc. With the increasing need to deploy such systems in new domains, solving the problem of zero/few-shot DST has become necessary. There has been a rising trend for learning to transfer knowledge from resource-rich domains to unknown domains with minimal need for additional data. In this work, we explore the merits of meta-learning algorithms for this transfer and hence, propose a meta-learner D-REPTILE specific to the DST problem. With extensive experimentation, we provide clear evidence of benefits over conventional approaches across different domains, methods, base models and datasets with significant (5-25%) improvement over the baseline in a low-data setting. Our proposed meta-learner is agnostic of the underlying model and hence any existing state-of-the-art DST system can improve its performance on unknown domains using our training strategy.