Saransh Sharma

2025

pdf bib abs
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick | Saransh Sharma | Abhik Jana | Pawan Goyal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multimodal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 dataset. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively. We release both the code and the dataset used for this work at https://github.com/Text-Takes-Over-EMNLP-2025/MultiModal-Intent-EMNLP-2025.

pdf bib abs
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Akriti Jain | Saransh Sharma | Koyel Mukherjee | Soumyabrata Pal
Findings of the Association for Computational Linguistics: EMNLP 2025

Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language tasks. However, due to sequential processing through multiple transformer layers, autoregressive decoding faces significant computational challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors: (1) early exit, and (2) input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations, the former cannot be applied in the presence of KV caching, which is essential for speed-ups in modern inference frameworks, and the latter fails to capture variation in layer importance across tasks or, more generally, across input sequences. To address these limitations, we propose FiRST, a model-agnostic framework that reduces inference latency by using layer-specific routers to adaptively skip transformer layers during decoding, based on routing decisions made from the input prompt in the prefill stage. FiRST remains fully compatible with KV caching, enabling faster decoding while maintaining quality. Our method reveals that input adaptivity is essential: Different tasks rely on different subsets of layers to evolve meaningful representations. Extensive experiments show that FiRST significantly reduces latency while outperforming existing layer selection strategies in quality. It retains performance comparable to the base model without skipping. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.

Co-authors

Soumyabrata Pal 1

Venues

emnlp1
findings1

Fix author