2025
pdf
bib
abs
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya
|
Holy Lovenia
|
Joel Ruben Antony Moniz
|
Tack Hwa Wong
|
Mohammad Rifqi Farhansyah
|
Thant Thiri Maung
|
Frederikus Hudi
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Muhammad Reza Qorib
|
Amit Agarwal
|
Joseph Marvin Imperial
|
Hitesh Laxmichand Patel
|
Vicky Feliren
|
Bahrul Ilmi Nasution
|
Manuel Antonio Rufino
|
Genta Indra Winata
|
Rian Adam Rajagede
|
Carlos Rafael Catalan
|
Mohamed Fazli Mohamed Imam
|
Priyaranjan Pattnayak
|
Salsabila Zahirah Pranida
|
Kevin Pratama
|
Yeshil Bangera
|
Adisai Na-Thalang
|
Patricia Nicole Monderin
|
Yueqi Song
|
Christian Simon
|
Lynnette Hui Xian Ng
|
Richardy Lobo Sapan
|
Taki Hasan Rafi
|
Bin Wang
|
Supryadi
|
Kanyakorn Veerakanjana
|
Piyalitt Ittichaiwong
|
Matthew Theodore Roque
|
Karissa Vincentio
|
Takdanai Kreangphet
|
Phakphum Artkaew
|
Kadek Hendrawan Palgunadi
|
Yanzhi Yu
|
Rochana Prih Hastuti
|
William Nixon
|
Mithil Bangera
|
Adrian Xuan Wei Lim
|
Aye Hninn Khine
|
Hanif Muhammad Zhafran
|
Teddy Ferdinan
|
Audra Aurora Izzani
|
Ayushman Singh
|
Evan Evan
|
Jauza Akbar Krito
|
Michael Anugraha
|
Fenal Ashokbhai Ilasariya
|
Haochen Li
|
John Amadeo Daniswara
|
Filbert Aurelian Tjiaranata
|
Eryawan Presma Yulianrifat
|
Can Udomcharoenchaikit
|
Fadil Risdian Ansori
|
Mahardika Krisna Ihsani
|
Giang Nguyen
|
Anab Maulana Barik
|
Dan John Velasco
|
Rifo Ahmad Genadi
|
Saptarshi Saha
|
Chengwei Wei
|
Isaiah Edri W. Flores
|
Kenneth Chen Ko Han
|
Anjela Gail D. Santos
|
Wan Shen Lim
|
Kaung Si Phyo
|
Tim Santos
|
Meisyarah Dwiastuti
|
Jiayun Luo
|
Jan Christian Blaise Cruz
|
Ming Shan Hee
|
Ikhlasul Akmal Hanif
|
M.Alif Al Hakim
|
Muhammad Rizky Sya’ban
|
Kun Kerdthaisong
|
Lester James Validad Miranda
|
Fajri Koto
|
Tirana Noor Fatyanosa
|
Alham Fikri Aji
|
Jostin Jerico Rosal
|
Jun Kevin
|
Robert Wijaya
|
Onno P. Kampman
|
Ruochen Zhang
|
Börje F. Karlsson
|
Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
pdf
bib
abs
SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models
Karan Dua
|
Puneet Mittal
|
Ranjeet Gupta
|
Hitesh Laxmichand Patel
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
High-quality Text-to-Speech (TTS) model training requires extensive and diverse text and speech data. It is challenging to procure such data from real sources due to issues of domain specificity, licensing, and scalability. Large language models (LLMs) can certainly generate textual data, but they create repetitive text with insufficient variation in the prompt during the generation process. Another important aspect in TTS training data is text normalization. Tools for normalization might occasionally introduce anomalies or overlook valuable patterns, and thus impact data quality. Furthermore, it is also impractical to rely on voice artists for large scale speech recording in commercial TTS systems with standardized voices. To address these challenges, we propose SpeechWeave, a synthetic speech data generation pipeline that is capable of automating the generation of multilingual, domain-specific datasets for training TTS models. Our experiments reveal that our pipeline generates data that is 10–48% more diverse than the baseline across various linguistic and phonetic metrics, along with speaker-standardized speech audio while generating approximately 97% correctly normalized text. Our approach enables scalable, high-quality data generation for TTS training, improving diversity, normalization, and voice consistency in the generated datasets.
pdf
bib
abs
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
Hansa Meghwani
|
Amit Agarwal
|
Priyaranjan Pattnayak
|
Hitesh Laxmichand Patel
|
Srikant Panda
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models.Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15% in MRR@3 and 19% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method’s generalizability and readiness for real-world applications.
pdf
bib
abs
AccessEval: Benchmarking Disability Bias in Large Language Models
Srikant Panda
|
Amit Agarwal
|
Hitesh Laxmichand Patel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real life queries. To systematically investigate these effects with various disability context, we introduce AccessEval, a large-scale benchmark evaluating total 21 close & open source LLMs across six real-world domains and nine disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for factual accuracy, sentiment, and social perception.Our analysis reveals that responses to disability-aware queries tend to have higher factual error, more negative tone, and increased stereotyping with social perception compared to neutral queries. These effects show notable variation by domain and disability type. Disabilities affecting hearing, speech and mobility are disproportionately impacted. These disparities reveal persistent forms of ableism, highlighting the need for more comprehensive and nuanced assessment.We further argue that framing bias in terms of model performance within real-world decision making helps to better link model behaviors to the potential harms users may face. This approach guides the development of more effective and tailored fairness interventions. AccessEval, therefore, serves as a crucial tool for advancing equitable and inclusive language technologies.
pdf
bib
abs
Aligning LLMs for Multilingual Consistency in Enterprise Applications
Amit Agarwal
|
Hansa Meghwani
|
Hitesh Laxmichand Patel
|
Tao Sheng
|
Sujith Ravi
|
Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English.We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.
pdf
bib
abs
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Amit Agarwal
|
Hitesh Laxmichand Patel
|
Srikant Panda
|
Hansa Meghwani
|
Jyotika Singh
|
Karan Dua
|
Paul Li
|
Tao Sheng
|
Sujith Ravi
|
Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development.We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset’s reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues.When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.
pdf
bib
abs
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
Hitesh Laxmichand Patel
|
Amit Agarwal
|
Srikant Panda
|
Hansa Meghwani
|
Karan Dua
|
Paul Li
|
Tao Sheng
|
Sujith Ravi
|
Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
The reliability of Multimodal Large Language Models (MLLMs) in real-world settings is often undermined by sensitivity to irrelevant or distracting visual context, an aspect not captured by existing evaluation metrics. We introduce the Patch Context Robustness Index (PCRI), the first systematic and interpretable score for quantifying MLLM robustness to variations in visual context granularity, measuring performance changes between localized image patches and full-image input.Applying PCRI to 19 state-of-the-art MLLMs across 15 vision-language benchmarks, we find that most leading models remain brittle to background noise, with only a few, such as InternVL2-26B and Qwen2VL-72B, demonstrating consistent robustness across tasks. PCRI analysis also highlights how different model architectures handle and integrate visual context, offering actionable diagnostic insight for both researchers and practitioners.PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding the development of future architectures and training strategies for robust, real-world deployment.
pdf
bib
abs
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
Karan Dua
|
Hitesh Laxmichand Patel
|
Puneet Mittal
|
Ranjeet Gupta
|
Amit Agarwal
|
Praneet Pabolu
|
Srikant Panda
|
Hansa Meghwani
|
Graham Horwood
|
Fahad Shah
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets spanning a wide range of document types. However, collecting such data is prohibitively expensive due to privacy constraints, legal restrictions, and the sheer volume of manual annotation needed - costs that can scale into millions of dollars. We introduce FlexDoc, a scalable synthetic data generation framework that combines Stochastic Schemas and Parameterized Sampling to produce realistic, multilingual semi-structured documents with rich annotations. By probabilistically modeling layout patterns, visual structure, and content variability, FlexDoc enables the controlled generation of diverse document variants at scale. Experiments on Key Information Extraction (KIE) tasks demonstrate that FlexDoc-generated data improves the absolute F1 Score by up to 11% when used to augment real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods. The solution is in active deployment, where it has accelerated the development of enterprise-grade document understanding models while significantly reducing data acquisition and annotation costs.
pdf
bib
abs
MVTamperBench: Evaluating Robustness of Vision-Language Models
Amit Agarwal
|
Srikant Panda
|
Angeline Charles
|
Hitesh Laxmichand Patel
|
Bhargava Kumar
|
Priyaranjan Pattnayak
|
Taki Hasan Rafi
|
Tejaswini Kumar
|
Hansa Meghwani
|
Karan Gupta
|
Dong-Kyu Chae
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal Large Language Models (MLLMs), are recent advancement of Vision-Language Models (VLMs) that have driven major advances in video understanding. However, their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping; based on real-world visual tampering scenarios such as surveillance interference, social media content edits, and misinformation injection. MVTamperBench comprises ~3.4K original videos, expanded into over ~17K tampered clips covering 19 distinct video manipulation tasks. This benchmark challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families. We reveal substantial variability in resilience across tampering types and show that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code, data, and benchmark to foster open research in trustworthy video understanding.
pdf
bib
abs
Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation
Priyaranjan Pattnayak
|
Amit Agarwal
|
Hansa Meghwani
|
Hitesh Laxmichand Patel
|
Srikant Panda
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.
pdf
bib
abs
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Hitesh Laxmichand Patel
|
Amit Agarwal
|
Arion Das
|
Bhargava Kumar
|
Srikant Panda
|
Priyaranjan Pattnayak
|
Taki Hasan Rafi
|
Tejaswini Kumar
|
Dong-Kyu Chae
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.