Nathan Bodenstab
2026
SchemaRAG: Dynamic Large Schema Reduction for LLM-driven Structured Information Extraction
Sin Yu Bonnie Ho | Arlie Coles | Erik Larsson | Eric Marshall | Nathan Bodenstab | Paul Vozila
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Sin Yu Bonnie Ho | Arlie Coles | Erik Larsson | Eric Marshall | Nathan Bodenstab | Paul Vozila
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Extracting structured data from unstructured text using large language models (LLMs) becomes challenging when the target schemas are large and complex. In such cases, including the full schema in the prompt increases cost and latency, risks lost-in-the-middle performance degradation, and can exceed context length limits. We propose SchemaRAG, a retrieval-augmented generation (RAG) framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks by leveraging schema metadata and few-shot examples (when available). We evaluate SchemaRAG on real-world healthcare and e-commerce datasets. Our results show that SchemaRAG can achieve up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs, demonstrating its practicality for large-schema extraction.
2025
Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
Jean-Philippe Corbeil | Asma Ben Abacha | George Michalopoulos | Phillip Swazinna | Miguel Del-Agua | Jerome Tremblay | Akila Jeeson Daniel | Cari Bader | Kevin Cho | Pooja Krishnan | Nathan Bodenstab | Thomas Lin | Wenxuan Teng | Francois Beaulieu | Paul Vozila
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Jean-Philippe Corbeil | Asma Ben Abacha | George Michalopoulos | Phillip Swazinna | Miguel Del-Agua | Jerome Tremblay | Akila Jeeson Daniel | Cari Bader | Kevin Cho | Pooja Krishnan | Nathan Bodenstab | Thomas Lin | Wenxuan Teng | Francois Beaulieu | Paul Vozila
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks — structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations — remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
2012
Finite-State Chart Constraints for Reduced Complexity Context-Free Parsing Pipelines
Brian Roark | Kristy Hollingshead | Nathan Bodenstab
Computational Linguistics, Volume 38, Issue 4 - December 2012
Brian Roark | Kristy Hollingshead | Nathan Bodenstab
Computational Linguistics, Volume 38, Issue 4 - December 2012
2011
Efficient Matrix-Encoded Grammars and Low Latency Parallelization Strategies for CYK
Aaron Dunlop | Nathan Bodenstab | Brian Roark
Proceedings of the 12th International Conference on Parsing Technologies
Aaron Dunlop | Nathan Bodenstab | Brian Roark
Proceedings of the 12th International Conference on Parsing Technologies