Eric Fosler-Lussier

Other people with similar names: Eric Fosler-Lussier

2026

HydraQE: OSU’s Submission for the IWSLT 2026 Speech Translation Metrics Shared Task
Kevin Krahn | Eric Fosler-Lussier
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a sparsemax scalar mix, then re-encoded by a bidirectional Transformer for full cross-modal interaction. To address the scarcity of human-annotated speech translation data, three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. We train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

pdf bib abs

Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
Jiyun Chun | Eric Fosler-Lussier | Michael White | Andrew Perrault
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating the quality of children’s utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child’s response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child’s contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child’s speech contributes to and advances the conversation within its context.

pdf bib abs

VISTA: Verification In Sequential Turn-based Assessment
Ashley Lewis | Andrew Perrault | Eric Fosler-Lussier | Michael White
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hallucination—defined here as generated statements unsupported or contradicted by available evidence or conversational context—remains a major obstacle to using conversational AI systems in settings that demand factual reliability. Existing metrics evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality via claim-level verification and sequential consistency tracking. VISTA decomposes each turn into atomic claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (Ais, Begin, FaithDial, and Fade), VISTA substantially improves hallucination detection over FActScore and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

Co-authors

Venues

Fix author