Thomas Thebaud
2026
Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech
Kaavya Chaparala | Thomas Thebaud | Jesus Villalba Lopez | Laureano Moro-Velazquez | Peter Viechnicki | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kaavya Chaparala | Thomas Thebaud | Jesus Villalba Lopez | Laureano Moro-Velazquez | Peter Viechnicki | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.
Towards Fair Speech Recognition: Mitigating Demographic Bias in End-to-End ASR Systems
Maliha Jahan | Thomas Thebaud | Zsuzsanna Fagyal | Jesus Villalba | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Maliha Jahan | Thomas Thebaud | Zsuzsanna Fagyal | Jesus Villalba | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Demographic bias in the performance of speech and language technology has been an active area of recent research. A lot of studies have shown the existence of demographic biases in Automatic Speech Recognition (ASR) systems. In this work, we propose a novel model-agnostic and demographic label-agnostic approach, called DARe, to mitigate any existing bias in an ASR system towards certain speaker groups. We built a debiasing module that goes between the feature extractor of an ASR and the rest of that ASR. The module includes content-group disentanglers to separate content and group, a demographic classifier, and adversarial reweighting. To eliminate the need for demographic labels, we generated pseudo-group labels by extracting speaker embeddings and clustering them. We worked with three ASR systems–Wav2Vec2 base, SEW tiny, and Whisper small. We used the FAI dataset, which contains naturalistic conversations with speakers who self-identify their demographic attributes. We used Word Error Rate (WER) as a metric of ASR performance and a Poisson regression-based approach to evaluate the racial fairness of the models. We compared the racial bias of the models before and after applying our proposed approach and observed a significant improvement in fairness.
2025
Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
Yen-Ju Lu | Thomas Thebaud | Laureano Moro-Velazquez | Najim Dehak | Jesus Villalba
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yen-Ju Lu | Thomas Thebaud | Laureano Moro-Velazquez | Najim Dehak | Jesus Villalba
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present Paired by the Teacher (PbT), a two-stage teacher–student pipeline that synthesizes accurate input–output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks—document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)—as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
2024
Finding Spoken Identifications: Using GPT-4 Annotation for an Efficient and Fast Dataset Creation Pipeline
Maliha Jahan | Helin Wang | Thomas Thebaud | Yinglun Sun | Giang Ha Le | Zsuzsanna Fagyal | Odette Scharenborg | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Maliha Jahan | Helin Wang | Thomas Thebaud | Yinglun Sun | Giang Ha Le | Zsuzsanna Fagyal | Odette Scharenborg | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI’s GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4’s performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4’s tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4’s performance.
2023
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein | Cihan Xiao | Neha Verma | Thomas Thebaud | Matthew Wiesner | Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Amir Hussein | Cihan Xiao | Neha Verma | Thomas Thebaud | Matthew Wiesner | Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper presents JHU’s submissions to the IWSLT 2023 dialectal and low-resource track of Tunisian Arabic to English speech translation. The Tunisian dialect lacks formal orthography and abundant training data, making it challenging to develop effective speech translation (ST) systems. To address these challenges, we explore the integration of large pre-trained machine translation (MT) models, such as mBART and NLLB-200 in both end-to-end (E2E) and cascaded speech translation (ST) systems. We also improve the performance of automatic speech recognition (ASR) through the use of pseudo-labeling data augmentation and channel matching on telephone data. Finally, we combine our E2E and cascaded ST systems with Minimum Bayes-Risk decoding. Our combined system achieves a BLEU score of 21.6 and 19.1 on test2 and test3, respectively.