Denis Andrei de Araujo
2026
A Dataset for Evaluating ASR on Specialized Vocabulary
Emily Haubert Klering | Eduardo Gabriel Cortes | Tatjana Chernenko | Mariana Vargas Trarbach | Gabriel de Oliveira Ramos | Sandro José Rigo | Maitê Dupont | Ana Luiza Treichel Vianna | Gabriela Krause dos Santos | Vinicius Meirelles Pereira | Denis Andrei de Araujo | Rafael Kunst
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Emily Haubert Klering | Eduardo Gabriel Cortes | Tatjana Chernenko | Mariana Vargas Trarbach | Gabriel de Oliveira Ramos | Sandro José Rigo | Maitê Dupont | Ana Luiza Treichel Vianna | Gabriela Krause dos Santos | Vinicius Meirelles Pereira | Denis Andrei de Araujo | Rafael Kunst
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Evaluating the ability of Automatic Speech Recognition (ASR) models to transcribe specialized vocabulary remains a persistent challenge, as standard datasets predominantly feature common words and thus obscure weaknesses on rare or out-of-vocabulary (OOV) terms. To address this limitation, we introduce a linguistically curated bilingual dataset (English and Portuguese) comprising 13,846 utterances (18.7 hours) distributed across synthetic and literature-derived subsets, with OOV rates reaching up to 100%. We further propose a diagnostic evaluation framework that partitions recognition performance into Biased Word Error Rate (B-WER), targeting domain-specific jargon, and Unbiased Word Error Rate (U-WER), focusing on general vocabulary. Baseline evaluations using Whisper models (medium, large-v3, and large-v3-turbo) confirm the necessity of this framework. On the most challenging datasets, B-WER reaches 0.88–0.90, whereas U-WER remains as low as 0.06–0.19, demonstrating that conventional WER masks critical failure modes in jargon recognition. Additionally, an oracle upper bound experiment shows that providing correct jargon via prompting reduces B-WER by 0.50–0.70 absolute, quantifying the considerable potential for contextual biasing. We release the datasets and evaluation scripts as a reproducible benchmark to foster research on domain-aware contextual biasing and OOV handling in ASR systems.