Amit Agarwal
2026
RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark
Mohammed Ali | Abdelrahman Abdallah | Amit Agarwal | Hitesh Laxmichand Patel | Adam Jatowt
Findings of the Association for Computational Linguistics: ACL 2026
Mohammed Ali | Abdelrahman Abdallah | Amit Agarwal | Hitesh Laxmichand Patel | Adam Jatowt
Findings of the Association for Computational Linguistics: ACL 2026
Existing benchmarks treat multi-turn conversation and reasoning-intensive retrieval separately, yet real-world information seeking requires both. To bridge this gap, we present a benchmark for reasoning-based conversational information retrieval comprising 707 conversations (2,971 turns) across eleven domains. To ensure quality, our Decomposition-and-Verification framework transforms complex queries into fact-grounded multi-turn dialogues through multi-level validation, where atomic facts are verified against sources and explicit retrieval reasoning is generated for each turn. Comprehensive evaluation reveals that combining conversation history with reasoning doubles retrieval performance (Baseline .236 → History+Reasoning .479 nDCG@10), while reasoning-specialized models substantially outperform dense encoders. Despite these gains, further analysis highlights that implicit reasoning remains challenging, particularly when logical connections are not explicitly stated in the text. [<https://github.com/RECOR-Benchmark/RECOR>]
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
Do Image–Text Metrics Respect Semantic Invariances?
Amit Agarwal | Hitesh Laxmichand Patel | Meizhu Liu | Jyotika Singh | Karan Dua | Hansa Meghwani | Matthew Rowe | M. Avendi | Yassi Abbasi | Tao Sheng | Sujith Ravi | Dan Roth
Findings of the Association for Computational Linguistics: ACL 2026
Amit Agarwal | Hitesh Laxmichand Patel | Meizhu Liu | Jyotika Singh | Karan Dua | Hansa Meghwani | Matthew Rowe | M. Avendi | Yassi Abbasi | Tao Sheng | Sujith Ravi | Dan Roth
Findings of the Association for Computational Linguistics: ACL 2026
Reference-free image–to–text evaluators are now standard for scoring image–caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes: spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities: benign spatial edits and simple phrasing changes shift scores by (≈)6–9% on average, and for systems separated by just 0.7% these shifts can cause ranking flips in upto (∼)37% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.
2025
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Amit Agarwal | Hitesh Laxmichand Patel | Srikant Panda | Hansa Meghwani | Jyotika Singh | Karan Dua | Paul Li | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Amit Agarwal | Hitesh Laxmichand Patel | Srikant Panda | Hansa Meghwani | Jyotika Singh | Karan Dua | Paul Li | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development.We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset’s reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues.When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.
Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs
Jyotika Singh | Weiyi Sun | Amit Agarwal | Viji Krishnamurthy | Yassine Benajiba | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Jyotika Singh | Weiyi Sun | Amit Agarwal | Viji Krishnamurthy | Yassine Benajiba | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored.This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.
LLM-Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations
Priyaranjan Pattnayak | Sanchari Chowdhuri | Amit Agarwal | Hitesh Laxmichand Patel
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Priyaranjan Pattnayak | Sanchari Chowdhuri | Amit Agarwal | Hitesh Laxmichand Patel
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Clustering customer chat data is vital for cloud providers handling multi-service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Re-clustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi-turn chats into service-specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via Davies–Bouldin Index (DBI) and Silhouette Scores, with LLM-based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real-time analytics without full re-clustering.
A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics
Angeline Charles | Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel | Priyaranjan Pattnayak | Bhargava Kumar | Tejaswini Kumar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Angeline Charles | Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel | Priyaranjan Pattnayak | Bhargava Kumar | Tejaswini Kumar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Reference-free metrics such as CLIPScore and PAC-S are increasingly used in vision-language tasks due to their scalability and independence from human-written references. However, their reliability under linguistic, visual, and cultural variation remains underexplored. In this work, we present a systematic audit of CLIPScore and PAC-S using an eight-factor diagnostic framework applied to MS-COCO validation images. Our analysis reveals consistent failure modes across dimensions including object size, content category, syntax, named entities, spatial relations and cultural context. Both metrics penalize captions referencing African (−5.5%, −4.8%) and Arabian (−4.9%, −5.3%) cultures, favor large-object and animal-centric scenes (by 20-30%) and show limited sensitivity to spatial negation and word order. CLIPScore correlates more strongly with syntactic complexity, while PAC-S demonstrates greater robustness to verbosity and named–entity variation highlighting complementary strengths rather than superiority. These findings expose cultural and content bias, weak semantic robustness, and limited compositional understanding. We conclude with design recommendations to improve fairness, scale invariance, and semantic grounding in future reference-free evaluation metrics.
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Amit Agarwal | Srikant Panda | Kulbhushan Pachauri
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Amit Agarwal | Srikant Panda | Kulbhushan Pachauri
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG’s capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance.
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
Hansa Meghwani | Amit Agarwal | Priyaranjan Pattnayak | Hitesh Laxmichand Patel | Srikant Panda
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Hansa Meghwani | Amit Agarwal | Priyaranjan Pattnayak | Hitesh Laxmichand Patel | Srikant Panda
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models.Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15% in MRR@3 and 19% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method’s generalizability and readiness for real-world applications.
AccessEval: Benchmarking Disability Bias in Large Language Models
Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real life queries. To systematically investigate these effects with various disability context, we introduce AccessEval, a large-scale benchmark evaluating total 21 close & open source LLMs across six real-world domains and nine disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for factual accuracy, sentiment, and social perception.Our analysis reveals that responses to disability-aware queries tend to have higher factual error, more negative tone, and increased stereotyping with social perception compared to neutral queries. These effects show notable variation by domain and disability type. Disabilities affecting hearing, speech and mobility are disproportionately impacted. These disparities reveal persistent forms of ableism, highlighting the need for more comprehensive and nuanced assessment.We further argue that framing bias in terms of model performance within real-world decision making helps to better link model behaviors to the potential harms users may face. This approach guides the development of more effective and tailored fairness interventions. AccessEval, therefore, serves as a crucial tool for advancing equitable and inclusive language technologies.
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Samuel Cahyawijaya | Holy Lovenia | Joel Ruben Antony Moniz | Tack Hwa Wong | Mohammad Rifqi Farhansyah | Thant Thiri Maung | Frederikus Hudi | David Anugraha | Muhammad Ravi Shulthan Habibi | Muhammad Reza Qorib | Amit Agarwal | Joseph Marvin Imperial | Hitesh Laxmichand Patel | Vicky Feliren | Bahrul Ilmi Nasution | Manuel Antonio Rufino | Genta Indra Winata | Rian Adam Rajagede | Carlos Rafael Catalan | Mohamed Fazli Mohamed Imam | Priyaranjan Pattnayak | Salsabila Zahirah Pranida | Kevin Pratama | Yeshil Bangera | Adisai Na-Thalang | Patricia Nicole Monderin | Yueqi Song | Christian Simon | Lynnette Hui Xian Ng | Richardy Lobo Sapan | Taki Hasan Rafi | Bin Wang | Supryadi | Kanyakorn Veerakanjana | Piyalitt Ittichaiwong | Matthew Theodore Roque | Karissa Vincentio | Takdanai Kreangphet | Phakphum Artkaew | Kadek Hendrawan Palgunadi | Yanzhi Yu | Rochana Prih Hastuti | William Nixon | Mithil Bangera | Adrian Xuan Wei Lim | Aye Hninn Khine | Hanif Muhammad Zhafran | Teddy Ferdinan | Audra Aurora Izzani | Ayushman Singh | Evan Evan | Jauza Akbar Krito | Michael Anugraha | Fenal Ashokbhai Ilasariya | Haochen Li | John Amadeo Daniswara | Filbert Aurelian Tjiaranata | Eryawan Presma Yulianrifat | Can Udomcharoenchaikit | Fadil Risdian Ansori | Mahardika Krisna Ihsani | Giang Nguyen | Anab Maulana Barik | Dan John Velasco | Rifo Ahmad Genadi | Saptarshi Saha | Chengwei Wei | Isaiah Edri W. Flores | Kenneth Chen Ko Han | Anjela Gail D. Santos | Wan Shen Lim | Kaung Si Phyo | Tim Santos | Meisyarah Dwiastuti | Jiayun Luo | Jan Christian Blaise Cruz | Ming Shan Hee | Ikhlasul Akmal Hanif | M.Alif Al Hakim | Muhammad Rizky Sya’ban | Kun Kerdthaisong | Lester James Validad Miranda | Fajri Koto | Tirana Noor Fatyanosa | Alham Fikri Aji | Jostin Jerico Rosal | Jun Kevin | Robert Wijaya | Onno P. Kampman | Ruochen Zhang | Börje F. Karlsson | Peerat Limkonchotiwat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
Hitesh Laxmichand Patel | Amit Agarwal | Srikant Panda | Hansa Meghwani | Karan Dua | Paul Li | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Hitesh Laxmichand Patel | Amit Agarwal | Srikant Panda | Hansa Meghwani | Karan Dua | Paul Li | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
The reliability of Multimodal Large Language Models (MLLMs) in real-world settings is often undermined by sensitivity to irrelevant or distracting visual context, an aspect not captured by existing evaluation metrics. We introduce the Patch Context Robustness Index (PCRI), the first systematic and interpretable score for quantifying MLLM robustness to variations in visual context granularity, measuring performance changes between localized image patches and full-image input.Applying PCRI to 19 state-of-the-art MLLMs across 15 vision-language benchmarks, we find that most leading models remain brittle to background noise, with only a few, such as InternVL2-26B and Qwen2VL-72B, demonstrating consistent robustness across tasks. PCRI analysis also highlights how different model architectures handle and integrate visual context, offering actionable diagnostic insight for both researchers and practitioners.PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding the development of future architectures and training strategies for robust, real-world deployment.
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Hitesh Laxmichand Patel | Amit Agarwal | Arion Das | Bhargava Kumar | Srikant Panda | Priyaranjan Pattnayak | Taki Hasan Rafi | Tejaswini Kumar | Dong-Kyu Chae
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Hitesh Laxmichand Patel | Amit Agarwal | Arion Das | Bhargava Kumar | Srikant Panda | Priyaranjan Pattnayak | Taki Hasan Rafi | Tejaswini Kumar | Dong-Kyu Chae
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
Aligning LLMs for Multilingual Consistency in Enterprise Applications
Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Tao Sheng | Sujith Ravi | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English.We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.
Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation
Priyaranjan Pattnayak | Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Srikant Panda
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Priyaranjan Pattnayak | Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Srikant Panda
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
Karan Dua | Hitesh Laxmichand Patel | Puneet Mittal | Ranjeet Gupta | Amit Agarwal | Praneet Pabolu | Srikant Panda | Hansa Meghwani | Graham Horwood | Fahad Shah
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Karan Dua | Hitesh Laxmichand Patel | Puneet Mittal | Ranjeet Gupta | Amit Agarwal | Praneet Pabolu | Srikant Panda | Hansa Meghwani | Graham Horwood | Fahad Shah
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets spanning a wide range of document types. However, collecting such data is prohibitively expensive due to privacy constraints, legal restrictions, and the sheer volume of manual annotation needed - costs that can scale into millions of dollars. We introduce FlexDoc, a scalable synthetic data generation framework that combines Stochastic Schemas and Parameterized Sampling to produce realistic, multilingual semi-structured documents with rich annotations. By probabilistically modeling layout patterns, visual structure, and content variability, FlexDoc enables the controlled generation of diverse document variants at scale. Experiments on Key Information Extraction (KIE) tasks demonstrate that FlexDoc-generated data improves the absolute F1 Score by up to 11% when used to augment real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods. The solution is in active deployment, where it has accelerated the development of enterprise-grade document understanding models while significantly reducing data acquisition and annotation costs.
Enhancing Causal Relationship Detection Using Prompt Engineering and Large Language Models
Pulkit Chatwal | Amit Agarwal | Ankush Mittal
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
Pulkit Chatwal | Amit Agarwal | Ankush Mittal
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
This paper explores the use of large language models (LLMs) and prompt engineering to detect causal relationships in financial disclosures. The task was part of the FinCausal 2025 shared competition, which focuses on identifying cause-and-effect relationships in financial texts across languages. The study demonstrates the effectiveness of LLMs, specifically LLaMA 3.2, in tackling causality detection in English and Spanish financial reports. The paper introduces various prompt engineering techniques, including zero-shot, few-shot, and chain-of-thought (CoT) prompting, to improve performance. For English, the best results were achieved using the Few-Shot + CoT approach, while for Spanish, the Few-Shot method provided strong semantic alignment despite lower exact match accuracy. The evaluation used two metrics: Exact Match (EM) and Semantic Alignment Score (SAS). The results showed high SAS scores for both languages, indicating good semantic understanding, with English performing particularly well. The study emphasizes the importance of tailored prompt engineering techniques to handle language-specific nuances in financial contexts and suggests future research directions, including fine-tuning LLaMA 3.2 and testing additional LLM architectures to enhance multilingual causality detection in financial texts.
MVTamperBench: Evaluating Robustness of Vision-Language Models
Amit Agarwal | Srikant Panda | Angeline Charles | Hitesh Laxmichand Patel | Bhargava Kumar | Priyaranjan Pattnayak | Taki Hasan Rafi | Tejaswini Kumar | Hansa Meghwani | Karan Gupta | Dong-Kyu Chae
Findings of the Association for Computational Linguistics: ACL 2025
Amit Agarwal | Srikant Panda | Angeline Charles | Hitesh Laxmichand Patel | Bhargava Kumar | Priyaranjan Pattnayak | Taki Hasan Rafi | Tejaswini Kumar | Hansa Meghwani | Karan Gupta | Dong-Kyu Chae
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal Large Language Models (MLLMs), are recent advancement of Vision-Language Models (VLMs) that have driven major advances in video understanding. However, their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping; based on real-world visual tampering scenarios such as surveillance interference, social media content edits, and misinformation injection. MVTamperBench comprises ~3.4K original videos, expanded into over ~17K tampered clips covering 19 distinct video manipulation tasks. This benchmark challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families. We reveal substantial variability in resilience across tampering types and show that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code, data, and benchmark to foster open research in trustworthy video understanding.
2024
IITRoorkee@SMM4H 2024 Cross-Platform Age Detection in Twitter and Reddit Using Transformer-Based Model
Thadavarthi Sankar | Dudekula Suraj | Mallamgari Reddy | Durga Toshniwal | Amit Agarwal
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
Thadavarthi Sankar | Dudekula Suraj | Mallamgari Reddy | Durga Toshniwal | Amit Agarwal
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
This paper outlines the methodology for the automatic extraction of self-reported ages from social media posts as part of the Social Media Mining for Health (SMM4H) 2024 Workshop Shared Tasks. The focus was on Task 6: “Self-reported exact age classification with cross-platform evaluation in English.” The goal was to accurately identify age-related information from user-generated content, which is crucial for applications in public health monitoring, targeted advertising, and demographic research. A number of transformer-based models were employed, including RoBERTa-Base, BERT-Base, BiLSTM, and Flan T5 Base, leveraging their advanced capabilities in natural language understanding. The training strategies included fine-tuning foundational pre-trained language models and evaluating model performance using standard metrics: F1-score, Precision, and Recall. The experimental results demonstrated that the RoBERTa-Base model significantly outperformed the other models in this classification task. The best results achieved with the RoBERTa-Base model were an F1-score of 0.878, a Precision of 0.899, and a Recall of 0.858.
AgriLLM:Harnessing Transformers for Framer Queries
Krish Didwania | Pratinav Seth | Aditya Kasliwal | Amit Agarwal
Proceedings of the Third Workshop on NLP for Positive Impact
Krish Didwania | Pratinav Seth | Aditya Kasliwal | Amit Agarwal
Proceedings of the Third Workshop on NLP for Positive Impact
Agriculture, vital for global sustenance, necessitates innovative solutions due to a lack of organized domain experts, particularly in developing countries where many farmers are impoverished and cannot afford expert consulting. Initiatives like Farmers Helpline play a crucial role in such countries, yet challenges such as high operational costs persist. Automating query resolution can alleviate the burden on traditional call centers, providing farmers with immediate and contextually relevant information.The integration of Agriculture and Artificial Intelligence (AI) offers a transformative opportunity to empower farmers and bridge information gaps.Language models like transformers, the rising stars of AI, possess remarkable language understanding capabilities, making them ideal for addressing information gaps in agriculture.This work explores and demonstrates the transformative potential of Large Language Models (LLMs) in automating query resolution for agricultural farmers, leveraging their expertise in deciphering natural language and understanding context. Using a subset of a vast dataset of real-world farmer queries collected in India, our study focuses on approximately 4 million queries from the state of Tamil Nadu, spanning various sectors, seasonal crops, and query types.
Search
Fix author
Co-authors
- Hitesh Laxmichand Patel 15
- Srikant Panda 11
- Hansa Meghwani 8
- Priyaranjan Pattnayak 7
- Karan Dua 5
- Sujith Ravi 5
- Dan Roth 5
- Tao Sheng 4
- Bhargava Kumar 3
- Tejaswini Kumar 3
- Taki Hasan Rafi 3
- Jyotika Singh 3
- David Anugraha 2
- Michael Anugraha 2
- Mithil Bangera 2
- Yeshil Bangera 2
- Dong-Kyu Chae 2
- Angeline Charles 2
- Vicky Feliren 2
- Muhammad Ravi Shulthan Habibi 2
- Ikhlasul Akmal Hanif 2
- Fenal Ashokbhai Ilasariya 2
- Joseph Marvin Imperial 2
- Kun Kerdthaisong 2
- Jun Kevin 2
- Paul Li 2
- Filbert Aurelian Tjiaranata 2
- Genta Indra Winata 2
- Tack Hwa Wong 2
- Yassi Abbasi 1
- Abdelrahman Abdallah 1
- Idris Abdulmumin 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Alham Fikri Aji 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Mohammed Ali 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- Fadil Risdian Ansori 1
- Catherine Arnett 1
- Phakphum Artkaew 1
- M. Avendi 1
- Anab Maulana Barik 1
- Yassine Benajiba 1
- Weerayut Buaphet 1
- Laurie Burchell 1
- Lanwenn ar C'horr 1
- Samuel Cahyawijaya 1
- Carlos Rafael Catalan 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Pulkit Chatwal 1
- Leshem Choshen 1
- Sanchari Chowdhuri 1
- Thibault Cl\'erice 1
- Jan Christian Blaise Cruz 1
- John Amadeo Daniswara 1
- Arion Das 1
- Rasul Dent 1
- Krish Didwania 1
- Konstantin Dobler 1
- Meisyarah Dwiastuti 1
- Evan Evan 1
- Mohammad Rifqi Farhansyah 1
- Tirana Noor Fatyanosa 1
- Teddy Ferdinan 1
- Isaiah Edri W. Flores 1
- Luca Foppiano 1
- Dmitry Gaynullin 1
- Rifo Ahmad Genadi 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Ranjeet Gupta 1
- Karan Gupta 1
- M.Alif Al Hakim 1
- Nadia Ghezaiel Hammouda 1
- Kenneth Chen Ko Han 1
- Rochana Prih Hastuti 1
- Ming Shan Hee 1
- Graham Horwood 1
- Frederikus Hudi 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Mahardika Krisna Ihsani 1
- Mohamed Fazli Mohamed Imam 1
- Piyalitt Ittichaiwong 1
- Audra Aurora Izzani 1
- Adam Jatowt 1
- Onno P. Kampman 1
- Börje F. Karlsson 1
- Aditya Kasliwal 1
- Amr Keleg 1
- Ilker Kesen 1
- Aye Hninn Khine 1
- Fajri Koto 1
- Takdanai Kreangphet 1
- Viji Krishnamurthy 1
- Jauza Akbar Krito 1
- Bruhan Kyomuhendo 1
- Yiyuan Li 1
- Haochen Li 1
- Adrian Xuan Wei Lim 1
- Wan Shen Lim 1
- Peerat Limkonchotiwat 1
- Meizhu Liu 1
- Holy Lovenia 1
- Sarah K. K. Luger 1
- Jiayun Luo 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Juan Pablo Martínez 1
- Thant Thiri Maung 1
- Lester James Validad Miranda 1
- Puneet Mittal 1
- Ankush Mittal 1
- Patricia Nicole Monderin 1
- Joel Ruben Antony Moniz 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Adisai Na-Thalang 1
- Bahrul Ilmi Nasution 1
- Hamada Nayel 1
- Lynnette Hui Xian Ng 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Giang Nguyen 1
- William Nixon 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Praneet Pabolu 1
- Kulbhushan Pachauri 1
- Quentin Pag\`es 1
- Kadek Hendrawan Palgunadi 1
- Kaung Si Phyo 1
- Salsabila Zahirah Pranida 1
- Kevin Pratama 1
- Vallerie Alexandra Putra 1
- Muhammad Reza Qorib 1
- Rian Adam Rajagede 1
- Ingrid Gabriela Franco Ramirez 1
- Mallamgari Reddy 1
- Benjamin L Rice 1
- Matthew Theodore Roque 1
- Jostin Jerico Rosal 1
- Matthew Rowe 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Manuel Antonio Rufino 1
- Benoît Sagot 1
- Saptarshi Saha 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Thadavarthi Sankar 1
- Anjela Gail D. Santos 1
- Tim Santos 1
- Richardy Lobo Sapan 1
- Jakhongir Saydaliev 1
- Pratinav Seth 1
- Fahad Shah 1
- Christian Simon 1
- Ayushman Singh 1
- Yueqi Song 1
- Pavel Stepachev 1
- Damian Stewart 1
- Weiyi Sun 1
- Supryadi 1
- Dudekula Suraj 1
- Muhammad Rizky Sya’ban 1
- Sotaro Takeshita 1
- Atnafu Lambebo Tonja 1
- Durga Toshniwal 1
- Yassine Toughrai 1
- Can Udomcharoenchaikit 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Kanyakorn Veerakanjana 1
- Dan John Velasco 1
- Karissa Vincentio 1
- Ahmad Mustapha Wali 1
- Bin Wang 1
- Azmine Toushik Wasi 1
- Chengwei Wei 1
- Robert Wijaya 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Yanzhi Yu 1
- Eryawan Presma Yulianrifat 1
- Hanif Muhammad Zhafran 1
- Mike Zhang 1
- Ruochen Zhang 1
- Ej Zhou 1