2025
pdf
bib
abs
A Variational Approach for Mitigating Entity Bias in Relation Extraction
Samuel Mensah
|
Elena Kochkina
|
Jabez Magomere
|
Joy Prakash Sain
|
Simerjot Kaur
|
Charese Smiley
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Mitigating entity bias is a critical challenge in Relation Extraction (RE), where models often rely excessively on entities, resulting in poor generalization. This paper presents a novel approach to address this issue by adapting a Variational Information Bottleneck (VIB) framework. Our method compresses entity-specific information while preserving task-relevant features. It achieves state-of-the-art performance on both general and financial domain RE datasets, excelling in in-domain settings (original test sets) and out-of-domain (modified test sets with type-constrained entity replacements). Our approach offers a robust, interpretable, and theoretically grounded methodology.
pdf
bib
abs
Advanced Messaging Platform (AMP): Pipeline for Automated Enterprise Email Processing
Simerjot Kaur
|
Charese Smiley
|
Keshav Ramani
|
Elena Kochkina
|
Mathieu Sibue
|
Samuel Mensah
|
Pietro Totis
|
Cecilia Tilli
|
Toyin Aguda
|
Daniel Borrajo
|
Manuela Veloso
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Understanding and effectively responding to email communication remains a critical yet complex challenge for current AI techniques, especially in corporate environments. These tasks are further complicated by the need for domain-specific knowledge, accurate entity recognition, and high precision to prevent costly errors. While recent advances in AI, specifically Large Language Models (LLMs), have made strides in natural language understanding, they often lack business-specific expertise required in such settings. In this work, we present Advanced Messaging Platform (AMP), a production-grade AI pipeline that automates email response generation at scale in real-world enterprise settings. AMP has been in production for more than a year, processing thousands of emails daily while maintaining high accuracy and adaptability to evolving business needs.
pdf
bib
abs
Calibrating LLM Confidence by Probing Perturbed Representation Stability
Reza Khanmohammadi
|
Erfan Miahi
|
Mehrsa Mardikoraem
|
Simerjot Kaur
|
Ivan Brugere
|
Charese Smiley
|
Kundan S Thind
|
Mohammad M. Ghassemi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model’s response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.
pdf
bib
abs
Translating Domain-Specific Terminology in Typologically-Diverse Languages: A Study in Tax and Financial Education
Arturo Oncevay
|
Elena Kochkina
|
Keshav Ramani
|
Toyin Aguda
|
Simerjot Kaur
|
Charese Smiley
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Domain-specific multilingual terminology is essential for accurate machine translation (MT) and cross-lingual NLP applications. We present a gold-standard terminology resource for the tax and financial education domains, built from curated governmental publications and covering seven typologically diverse languages: English, Spanish, Russian, Vietnamese, Korean, Chinese (traditional and simplified) and Haitian Creole. Using this resource, we assess various MT systems and LLMs on translation quality and term accuracy. We annotate over 3,000 terms for domain-specificity, facilitating a comparison between domain-specific and general term translations, and observe models’ challenges with specialized tax terms. We also analyze the case of terminology-aided translation, and the LLMs’ performance in extracting the translated term given the context. Our results highlight model limitations and the value of high-quality terminologies for advancing MT research in specialized contexts.
pdf
bib
abs
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
Jabez Magomere
|
Elena Kochkina
|
Samuel Mensah
|
Simerjot Kaur
|
Charese Smiley
Findings of the Association for Computational Linguistics: NAACL 2025
We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset’s difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
pdf
bib
abs
Conservative Bias in Large Language Models: Measuring Relation Predictions
Toyin Aguda
|
Erik Wilson
|
Allan Anzagira
|
Simerjot Kaur
|
Charese Smiley
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to no_relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson’s choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
2024
pdf
bib
abs
DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding
Dongsheng Wang
|
Natraj Raman
|
Mathieu Sibue
|
Zhiqiang Ma
|
Petr Babkin
|
Simerjot Kaur
|
Yulong Pei
|
Armineh Nourbakhsh
|
Xiaomo Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Enterprise documents such as forms, receipts, reports, and other such records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
pdf
bib
abs
Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency
Toyin D. Aguda
|
Suchetha Siddagangappa
|
Elena Kochkina
|
Simerjot Kaur
|
Dongsheng Wang
|
Charese Smiley
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Collecting labeled datasets in finance is challenging due to scarcity of domain experts and higher cost of employing them. While Large Language Models (LLMs) have demonstrated remarkable performance in data annotation tasks on general domain datasets, their effectiveness on domain specific datasets remains under-explored. To address this gap, we investigate the potential of LLMs as efficient data annotators for extracting relations in financial documents. We compare the annotations produced by three LLMs (GPT-4, PaLM 2, and MPT Instruct) against expert annotators and crowdworkers. We demonstrate that the current state-of-the-art LLMs can be sufficient alternatives to non-expert crowdworkers. We analyze models using various prompts and parameter settings and find that customizing the prompts for each relation group by providing specific examples belonging to those groups is paramount. Furthermore, we introduce a reliability index (LLM-RelIndex) used to identify outputs that may require expert attention. Finally, we perform an extensive time, cost and error analysis and provide recommendations for the collection and usage of automated annotations in domain-specific settings.