Maya Kruse


2025

pdf bib
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification
Maya Kruse | Majid Afshar | Saksham Khatwani | Anoop Mayampurath | Guanhua Chen | Yanjun Gao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at: https://github.com/LARK-NLP-Lab/MUSE.

pdf bib
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Chester Palen-Michel | Maxwell Pickering | Maya Kruse | Jonne Sälevä | Constantine Lignos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets.OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies.We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER.We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER.We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.OpenNER is released at https://github.com/bltlab/open-ner.

pdf bib
Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction
Maya Kruse | Shiyue Hu | Nicholas Derby | Yifu Wu | Samantha Stonbraker | Bingsheng Yao | Dakuo Wang | Elizabeth M. Goldberg | Yanjun Gao
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in large language models (LLMs) have shown potential in clinical text summarization, but their ability to handle long patient trajectories with multi-modal data spread across time remains underexplored. This study systematically evaluates several state-of-the-art open-source LLMs, their Retrieval Augmented Generation (RAG) variants and chain-of-thought (CoT) prompting on long-context clinical summarization and prediction. We examine their ability to synthesize structured and unstructured Electronic Health Records (EHR) data while reasoning over temporal coherence, by re-engineering existing tasks, including discharge summarization and diagnosis prediction from two publicly available EHR datasets. Our results indicate that long context windows improve input integration but do not consistently enhance clinical reasoning, and LLMs are still struggling with temporal progression and rare disease prediction. While RAG shows improvements in hallucination in some cases, it does not fully address these limitations. Our work fills the gap in long clinical text summarization, establishing a foundation for evaluating LLMs with multi-modal data and temporal reasoning.

2023

pdf bib
Improving NER Research Workflows with SeqScore
Constantine Lignos | Maya Kruse | Andrew Rueda
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We describe the features of SeqScore, an MIT-licensed Python toolkit for working with named entity recognition (NER) data.While SeqScore began as a tool for NER scoring, it has been expanded to help with the full lifecycle of working with NER data: validating annotation, providing at-a-glance and detailed summaries of the data, modifying annotation to support experiments, scoring system output, and aiding with error analysis.SeqScore is released via PyPI (https://pypi.org/project/seqscore/) and development occurs on GitHub (https://github.com/bltlab/seqscore).