Rupak Vignesh Swaminathan
2026
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding
Masao Someki | Chien-yu Huang | Siddhant Arora | Samuele Cornell | Markus Müller | Nathan Susanj | Rupak Vignesh Swaminathan | Grant Strimel | Jing Liu | Shinji Watanabe
Findings of the Association for Computational Linguistics: ACL 2026
Masao Someki | Chien-yu Huang | Siddhant Arora | Samuele Cornell | Markus Müller | Nathan Susanj | Rupak Vignesh Swaminathan | Grant Strimel | Jing Liu | Shinji Watanabe
Findings of the Association for Computational Linguistics: ACL 2026
Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and sound events. To address these challenges, we propose PlanRAG-Audio, a planning-based retrieval-augmented generation framework for scalable long-form audio understanding. Rather than having audio LALMs process entire recordings directly, PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models. Experiments across a wide range of speech/audio retrieval demonstrate that PlanRAG-Audio improves reasoning accuracy and stabilizes performance as audio duration increases by decoupling inference cost from raw audio length.
2025
Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models
Ryan Solgi | Kai Zhen | Rupak Vignesh Swaminathan | Nathan Susanj | Athanasios Mouchtaris | Siegfried Kunzmann | Zheng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ryan Solgi | Kai Zhen | Rupak Vignesh Swaminathan | Nathan Susanj | Athanasios Mouchtaris | Siegfried Kunzmann | Zheng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained LLMs for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.
Wanda++: Pruning Large Language Models via Regional Gradients
Yifan Yang | Kai Zhen | Bhavana Ganesh | Aram Galstyan | Goeric Huybrechts | Markus Müller | Jonas M. Kübler | Rupak Vignesh Swaminathan | Athanasios Mouchtaris | Sravan Babu Bodapati | Nathan Susanj | Zheng Zhang | Jack FitzGerald | Abhishek Kumar
Findings of the Association for Computational Linguistics: ACL 2025
Yifan Yang | Kai Zhen | Bhavana Ganesh | Aram Galstyan | Goeric Huybrechts | Markus Müller | Jonas M. Kübler | Rupak Vignesh Swaminathan | Athanasios Mouchtaris | Sravan Babu Bodapati | Nathan Susanj | Zheng Zhang | Jack FitzGerald | Abhishek Kumar
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level regional gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey | Rupak Vignesh Swaminathan | K V Vijay Girish | Arunasish Sen | Jian. Xie | Grant Strimel | Andreas Schwarz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prabhat Pandey | Rupak Vignesh Swaminathan | K V Vijay Girish | Arunasish Sen | Jian. Xie | Grant Strimel | Andreas Schwarz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
Search
Fix author
Co-authors
- Nathan Susanj 3
- Athanasios Mouchtaris 2
- Markus Müller 2
- Grant Strimel 2
- Zheng Zhang 2
- Kai Zhen 2
- Siddhant Arora 1
- Sravan Babu Bodapati 1
- Samuele Cornell 1
- Jack Fitzgerald 1
- Aram Galstyan 1
- Bhavana Ganesh 1
- K V Vijay Girish 1
- Chien-yu Huang 1
- Goeric Huybrechts 1
- Abhishek Kumar 1
- Siegfried Kunzmann 1
- Jonas M. Kübler 1
- Jing Liu (刘晶, 刘璟) 1
- Prabhat Pandey 1
- Andreas Schwarz 1
- Arunasish Sen 1
- Ryan Solgi 1
- Masao Someki 1
- Shinji Watanabe 1
- Jian. Xie 1
- Yifan Yang 1