Karen Zhou
2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
Karen Zhou
|
John Michael Giorgi
|
Pranav Mani
|
Peng Xu
|
Davis Liang
|
Chenhao Tan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.
2023
Entity-Based Evaluation of Political Bias in Automatic Summarization
Karen Zhou
|
Chenhao Tan
Findings of the Association for Computational Linguistics: EMNLP 2023
Growing literature has shown that NLP systems may encode social biases; however, the *political* bias of summarization models remains relatively unknown. In this work, we use an entity replacement method to investigate the portrayal of politicians in automatically generated summaries of news articles. We develop an entity-based computational framework to assess the sensitivities of several extractive and abstractive summarizers to the politicians Donald Trump and Joe Biden. We find consistent differences in these summaries upon entity replacement, such as reduced emphasis of Trump’s presence in the context of the same article and a more individualistic representation of Trump with respect to the collective US government (i.e., administration). These summary dissimilarities are most prominent when the entity is heavily featured in the source article. Our characterization provides a foundation for future studies of bias in summarization and for normative discussions on the ideal qualities of automatic summaries.
2021
Assessing Cognitive Linguistic Influences in the Assignment of Blame
Karen Zhou
|
Ana Smith
|
Lillian Lee
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media
Lab studies in cognition and the psychology of morality have proposed some thematic and linguistic factors that influence moral reasoning. This paper assesses how well the findings of these studies generalize to a large corpus of over 22,000 descriptions of fraught situations posted to a dedicated forum. At this social-media site, users judge whether or not an author is in the wrong with respect to the event that the author described. We find that, consistent with lab studies, there are statistically significant differences in uses of first-person passive voice, as well as first-person agents and patients, between descriptions of situations that receive different blame judgments. These features also aid performance in the task of predicting the eventual collective verdicts.
Search
Fix author
Co-authors
- Chenhao Tan 2
- John Michael Giorgi 1
- Lillian Lee 1
- Davis Liang 1
- Pranav Mani 1
- show all...