Jeff Ma
2026
SAUCE: Summary Analysis Using Conversation Entailment
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
With the growing need for evaluating Large Language Models (LLMs) and their applications to speech, challenges persist in summarizing and evaluating conversations that lack a clear end goal. We introduce SAUCE – a reference-free, fact-based evaluation pipeline for cross-lingual conversational speech summarization. It measures the accuracy and the fact coverage of a summary through the entailment between conversation and text. We compare SAUCE against several popular summarization metrics and demonstrate the effectiveness of capturing information loss due to transcription and translation error and identifying broken summaries. Crucially, unlike black-box LLM evaluators or dense embedding metrics, SAUCE is inherently explainable: it maps summary scores to discrete, verifiable facts, allowing users to pinpoint exact hallucinations or omissions. We illustrate how this interpretability helps developers systematically profile LLM behaviors and gives end-users an actionable tool to verify summary accuracy in noisy, real-world conditions. Preliminary investigations show SAUCE strongly align with human judgment.