Kalpa Gunaratna
2026
Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA
Ximing Wen | Wenbo Li | Sudipta Paul | Yashas Malur Saidutta | Kalpa Gunaratna | Srinivas Chappidi
Findings of the Association for Computational Linguistics: ACL 2026
Ximing Wen | Wenbo Li | Sudipta Paul | Yashas Malur Saidutta | Kalpa Gunaratna | Srinivas Chappidi
Findings of the Association for Computational Linguistics: ACL 2026
Visually Grounded Document Question Answering often lacks robust, end-to-end solutions capable of handling complex, multi-answer queries without reliance on ad-hoc processing. In this work, we propose two turnkey LLM architectures to address this gap. We first introduce a single-head architecture where coordinates are represented as special tokens within the unified vocabulary. While structurally robust, this approach suffers from the limitations of discrete supervision; to address this, we propose a novel “softening token” method that enables differentiable Mean-Squared-Error loss over token probabilities. Although this significantly improves visual grounding, the spatial precision remains bound by discretization. Consequently, we propose a second solution: a dual-head architecture that alternates between text generation and regression-based bounding box prediction. This method offers high spatial precision via a regression head, further stabilized by our introduction of an Intersection-over-Union loss. Finally, by combining the single head model’s structural robustness with the high precision of the dual head model, we propose an ensemble method that yields significant performance gains beyond each of individual components.
IMRNNs: An Efficient Method for Interpretable Dense Retrieval via Embedding Modulation
Yash Saxena | Ankur Padia | Kalpa Gunaratna | Manas Gaur
Findings of the Association for Computational Linguistics: EACL 2026
Yash Saxena | Ankur Padia | Kalpa Gunaratna | Manas Gaur
Findings of the Association for Computational Linguistics: EACL 2026
Interpretability in black-box dense retrievers remains a central challenge in Retrieval-Augmented Generation (RAG). Understanding how queries and documents semantically interact is critical for diagnosing retrieval behavior and improving model design. However, existing dense retrievers rely on static embeddings for both queries and documents, which obscures this bidirectional relationship. Post-hoc approaches such as re-rankers are computationally expensive, add inference latency, and still fail to reveal the underlying semantic alignment. To address these limitations, we propose Interpretable Modular Retrieval Neural Networks (IMRNNs), a lightweight framework that augments any dense retriever with dynamic, bidirectional modulation at inference time. IMRNNs employ two independent adapters: one conditions document embeddings on the current query, while the other refines the query embedding using corpus-level feedback from initially retrieved documents. This iterative modulation process enables the model to adapt representations dynamically and expose interpretable semantic dependencies between queries and documents. Empirically, IMRNNs not only enhance interpretability but also improve retrieval effectiveness. Across seven benchmark datasets, applying our method to standard dense retrievers yields average gains of +6.35% nDCG, +7.14% recall, and +7.04% MRR over state-of-the-art baselines. These results demonstrate that incorporating interpretability-driven modulation can both explain and enhance retrieval in RAG systems.
2022
Explainable Slot Type Attentions to Improve Joint Intent Detection and Slot Filling
Kalpa Gunaratna | Vijay Srinivasan | Akhila Yerukola | Hongxia Jin
Findings of the Association for Computational Linguistics: EMNLP 2022
Kalpa Gunaratna | Vijay Srinivasan | Akhila Yerukola | Hongxia Jin
Findings of the Association for Computational Linguistics: EMNLP 2022
Joint intent detection and slot filling is a key research topic in natural language understanding (NLU). Existing joint intent and slot filling systems analyze and compute features collectively for all slot types, and importantly, have no way to explain the slot filling model decisions. In this work, we propose a novel approach that: (i) learns to generate additional slot type specific features in order to improve accuracy and (ii) provides explanations for slot filling decisions for the first time in a joint NLU model. We perform an additional constrained supervision using a set of binary classifiers for the slot type specific feature learning, thus ensuring appropriate attention weights are learned in the process to explain slot filling decisions for utterances. Our model is inherently explainable and does not need any post-hoc processing. We evaluate our approach on two widely used datasets and show accuracy improvements. Moreover, a detailed analysis is also provided for the exclusive slot explainability.