Pranav Mani
2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
Karen Zhou
|
John Michael Giorgi
|
Pranav Mani
|
Peng Xu
|
Davis Liang
|
Chenhao Tan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.
2024
Fast Evidence Extraction for Grounded Language Model Outputs
Pranav Mani
|
Davis Liang
|
Zachary Chase Lipton
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)
Summarizing documents with Large Language Models (LLMs) warrants a rigorous inspection of the resulting outputs by humans. However, unaided verification of generated outputs is time-intensive and intractable at scale. For high-stakes applications like healthcare where verification is necessary, expediting this step can unlock massive gains in productivity. In this paper, we focus on the task of evidence extraction for abstractive summarization: for each summary line, extract the corresponding evidence spans from a source document. Viewing this evidence extraction problem through the lens of extractive question answering, we train a set of fast and scalable hierarchical architectures: EarlyFusion, MidFusion, and LateFusion. Our experiments show that (i) our method outperforms the state-of-the-art by 1.4% relative F1-Score; (ii) our model architecture reduces latency by 4x over a RoBERTa-Large baseline; and (iii) pretraining on an extractive QA corpus confers positive transfer to evidence extraction, especially in low-resource regimes.
Search
Fix author
Co-authors
- Davis Liang 2
- John Michael Giorgi 1
- Zachary Chase Lipton 1
- Chenhao Tan 1
- Peng Xu 1
- show all...