Grant Strimel
2026
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding
Masao Someki | Chien-yu Huang | Siddhant Arora | Samuele Cornell | Markus M\"uller | Nathan Susanj | Rupak Vignesh Swaminathan | Grant Strimel | Jing Liu | Shinji Watanabe
Findings of the Association for Computational Linguistics: ACL 2026
Masao Someki | Chien-yu Huang | Siddhant Arora | Samuele Cornell | Markus M\"uller | Nathan Susanj | Rupak Vignesh Swaminathan | Grant Strimel | Jing Liu | Shinji Watanabe
Findings of the Association for Computational Linguistics: ACL 2026
Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and sound events. To address these challenges, we propose PlanRAG-Audio, a planning-based retrieval-augmented generation framework for scalable long-form audio understanding. Rather than having audio LALMs process entire recordings directly, PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models. Experiments across a wide range of speech/audio retrieval demonstrate that PlanRAG-Audio improves reasoning accuracy and stabilizes performance as audio duration increases by decoupling inference cost from raw audio length.
2025
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey | Rupak Vignesh Swaminathan | K V Vijay Girish | Arunasish Sen | Jian. Xie | Grant Strimel | Andreas Schwarz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prabhat Pandey | Rupak Vignesh Swaminathan | K V Vijay Girish | Arunasish Sen | Jian. Xie | Grant Strimel | Andreas Schwarz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
2024
Multi-Modal Retrieval For Large Language Model Based Speech Recognition
Aditya Gourav | Jari Kolehmainen | Prashanth Shivakumar | Yile Gu | Grant Strimel | Ankur Gandhe | Ariya Rastrow | Ivan Bulyko
Findings of the Association for Computational Linguistics: ACL 2024
Aditya Gourav | Jari Kolehmainen | Prashanth Shivakumar | Yile Gu | Grant Strimel | Ankur Gandhe | Ariya Rastrow | Ivan Bulyko
Findings of the Association for Computational Linguistics: ACL 2024
Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.