2025
pdf
bib
abs
ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts
Karthikeyan K
|
Raghuveer Thirukovalluru
|
David Carlson
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question–answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2–3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.
pdf
bib
abs
A Study on Leveraging Search and Self-Feedback for Agent Reasoning
Karthikeyan K
|
Michelle Yuan
|
Elman Mansimov
|
Katerina Margatina
|
Anurag Pratik
|
Daniele Bonadiman
|
Monica Sunkara
|
Yi Zhang
|
Yassine Benajiba
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model’s own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we investigate how search and model’s self-feedback can be leveraged for reasoning tasks. First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning. Second, we observe limitations in applying search techniques to more complex tasks like tool-calling and design domain-specific approaches to address these gaps. Our experiments reveal challenges related to generalization when solely relying on self-feedback during search. For search to work effectively, either access to the ground-truth is needed or feedback mechanisms need to be carefully designed for the specific task.
2023
pdf
bib
abs
Taxonomy Expansion for Named Entity Recognition
Karthikeyan K
|
Yogarshi Vyas
|
Jie Ma
|
Giovanni Paolini
|
Neha John
|
Shuai Wang
|
Yassine Benajiba
|
Vittorio Castelli
|
Dan Roth
|
Miguel Ballesteros
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.
2022
pdf
bib
abs
Multilingual CheckList: Generation and Evaluation
Karthikeyan K
|
Shaily Bhatt
|
Pankaj Singh
|
Somak Aditya
|
Sandipan Dandapat
|
Sunayana Sitaram
|
Monojit Choudhury
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm –Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance. We release the code of TEA and the CheckLists created at aka.ms/multilingualchecklist
2021
pdf
bib
abs
Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance
Karthikeyan K
|
Aalok Sathe
|
Somak Aditya
|
Monojit Choudhury
Proceedings of the 1st Workshop on Multilingual Representation Learning
Multilingual language models achieve impressive zero-shot accuracies in many languages in complex tasks such as Natural Language Inference (NLI). Examples in NLI (and equivalent complex tasks) often pertain to various types of sub-tasks, requiring different kinds of reasoning. Certain types of reasoning have proven to be more difficult to learn in a monolingual context, and in the crosslingual context, similar observations may shed light on zero-shot transfer efficiency and few-shot sample selection. Hence, to investigate the effects of types of reasoning on transfer performance, we propose a category-annotated multilingual NLI dataset and discuss the challenges to scale monolingual annotations to multiple languages. We statistically observe interesting effects that the confluence of reasoning types and language similarities have on transfer performance.
2020
pdf
bib
abs
Extending Multilingual BERT to Low-Resource Languages
Zihan Wang
|
Karthikeyan K
|
Stephen Mayhew
|
Dan Roth
Findings of the Association for Computational Linguistics: EMNLP 2020
Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success is focused only on the top 104 languages in Wikipedia it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT E-MBERT so it can benefit any new language, and show that our approach aids languages that are already in M-BERT as well. We perform an extensive set of experiments with Named Entity Recognition (NER) on 27 languages, only 16 of which are in M-BERT, and show an average increase of about 6% F1 on M-BERT languages and 23% F1 increase on new languages. We release models and code at
http://cogcomp.org/page/publication_view/912.