David W Eyre
2025
Tree-of-Quote Prompting Improves Factuality and Attribution in Multi-Hop and Medical Reasoning
Justin Xu
|
Yiming Li
|
Zizheng Zhang
|
Augustine Yui Hei Luk
|
Mayank Jobanputra
|
Samarth Oza
|
Ashley Murray
|
Meghana Reddy Kasula
|
Andrew Parker
|
David W Eyre
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) can produce fluent but factually incorrect outputs and often have limited ability to attribute their claims to source material. This undermines their reliability, particularly in multi-hop and high-stakes domains such as medicine. We propose Tree-of-Quote (ToQ) prompting, a structured framework that decomposes complex questions into subquestions, generates quotes to support each step without retrieval, and selectively advances reasoning based on quote quality. We also introduce FQ-Score, a unified metric that captures answer correctness, attribution fidelity, and reasoning quality. Experiments on StrategyQA, 2WikiMultiHopQA, MuSiQue, MoreHopQA, and MedQA demonstrate that ToQ improves factuality and attribution over standard prompting baselines. To validate FQ-Score as a proxy for human judgment, we conduct two reader studies with clinicians on medical questions, and observe strong correlations. Both clinician scores and FQ-Scores also indicate a preference for ToQ over baselines due to a combination of greater correctness, completeness, and logical flow. Our results suggest ToQ is a promising approach for building more trustworthy and auditable LLM systems.
RadEval: A framework for radiology text evaluation
Justin Xu
|
Xi Zhang
|
Javid Abderezaei
|
Julie Bauml
|
Roger Boodoo
|
Fatemeh Haghighi
|
Ali Ganjizadeh
|
Eric Brattain
|
Dave Van Veen
|
Zaiqiao Meng
|
David W Eyre
|
Jean-Benoit Delbrouck
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics - from classic n‐gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM‐based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder - demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
Search
Fix author
Co-authors
- Justin Xu 2
- Javid Abderezaei 1
- Julie Bauml 1
- Roger Boodoo 1
- Eric Brattain 1
- show all...