2025
pdf
bib
abs
What is in a name? Mitigating Name Bias in Text Embedding Similarity via Anonymization
Sahil Manchanda
|
Pannaga Shivaswamy
Findings of the Association for Computational Linguistics: ACL 2025
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of names such as persons, locations, organizations etc. in the text. Our study shows how the presence of name-bias in text-embedding models can potentially lead to erroneous conclusions in the assessment of thematic similarity. Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose text-anonymization during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on three downstream NLP tasks involving embedding similarities, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
2022
pdf
bib
abs
Optum’s Submission to WMT22 Biomedical Translation Tasks
Sahil Manchanda
|
Saurabh Bhagwat
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes Optum’s submission to the Biomedical Translation task of the seventh conference on Machine Translation (WMT22). The task aims at promoting the development and evaluation of machine translation systems in their ability to handle challenging domain-specific biomedical data. We made submissions to two sub-tracks of ClinSpEn 2022, namely, ClinSpEn-CC (clinical cases) and ClinSpEn-OC (ontology concepts). These sub-tasks aim to test translation from English to Spanish. Our approach involves fine-tuning a pre-trained transformer model using in-house clinical domain data and the biomedical data provided by WMT. The fine-tuned model results in a test BLEU score of 38.12 in the ClinSpEn-CC (clinical cases) subtask, which is a gain of 1.23 BLEU compared to the pre-trained model.
2021
pdf
bib
abs
Optum at MEDIQA 2021: Abstractive Summarization of Radiology Reports using simple BART Finetuning
Ravi Kondadadi
|
Sahil Manchanda
|
Jason Ngo
|
Ronan McCormack
Proceedings of the 20th Workshop on Biomedical Language Processing
This paper describes experiments undertaken and their results as part of the BioNLP MEDIQA 2021 challenge. We participated in Task 3: Radiology Report Summarization. Multiple runs were submitted for evaluation, from solutions leveraging transfer learning from pre-trained transformer models, which were then fine tuned on a subset of MIMIC-CXR, for abstractive report summarization. The task was evaluated using ROUGE and our best performing system obtained a ROUGE-2 score of 0.392.
2020
pdf
bib
abs
Domain Informed Neural Machine Translation: Developing Translation Services for Healthcare Enterprise
Sahil Manchanda
|
Galina Grunin
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
Neural Machine Translation (NMT) is a deep learning based approach that has achieved outstanding results lately in the translation community. The performance of NMT systems, however, is dependent on the availability of large amounts of in-domain parallel corpora. The business enterprises in domains such as legal and healthcare require specialized vocabulary but translation systems trained for a general purpose do not cater to these needs. The data in these domains is either hard to acquire or is very small in comparison to public data sets. This is a detailed report of using an open-source library to implement a machine translation system and successfully customizing it for the needs of a particular client in the healthcare domain. This report details the chronological development of every component of this system, namely, extraction of data from in-domain healthcare documents, a pre-processing pipeline for the data, data alignment and augmentation, training and a fully automated and robust deployment pipeline. This work proposes an efficient way for the continuous deployment of newly trained deep learning models. The deployed translation models are optimized for both inference time and cost.