Pulkit Arya


2025

pdf bib
Monolingual Adapter Networks for Efficient Cross-Lingual Alignment
Pulkit Arya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Multilingual alignment for low-resource languages is a challenge for embedding models. The scarcity of parallel datasets in addition to rich morphological diversity in languages adds to the complexity of training multilingual embedding models. To aid in the development of multilingual models for under-represented languages such as Sanskrit, we introduce GitaDB: a collection of 640 Sanskrit verses translated in 5 Indic languages and English. We benchmarked various state-of-the-art embedding models on our dataset in different bilingual and cross-lingual semantic retrieval tasks of increasing complexity and found a steep degradation in retrieval scores. We found a wide margin in the retrieval performance between English and Sanskrit targets. To bridge this gap, we introduce Monolingual Adapter Networks: a parameter-efficient method to bolster cross-lingual alignment of embedding models without the need for parallel corpora or full finetuning.

2023

pdf bib
Bootstrapping a Conversational Guide for Colonoscopy Prep
Pulkit Arya | Madeleine Bloomquist | Subhankar Chakraborty | Andrew Perrault | William Schuler | Eric Fosler-Lussier | Michael White
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Creating conversational systems for niche domains is a challenging task, further exacerbated by a lack of quality datasets. We explore the construction of safer conversational systems for guiding patients in preparing for colonoscopies. This has required a data generation pipeline to generate a minimum viable dataset to bootstrap a semantic parser, augmented by automatic paraphrasing. Our study suggests large language models (e.g., GPT-3.5 and GPT-4) are a viable alternative to crowd sourced paraphrasing, but conversational systems that rely upon language models’ ability to do temporal reasoning struggle to provide accurate responses. A neural-symbolic system that performs temporal reasoning on an intermediate representation of user queries shows promising results compared to an end-to-end dialogue system, improving the number of correct responses while vastly reducing the number of incorrect or misleading ones.