Thomas Thebaud


2025

pdf bib
Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
Yen-Ju Lu | Thomas Thebaud | Laureano Moro-Velazquez | Najim Dehak | Jesus Villalba
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present Paired by the Teacher (PbT), a two-stage teacher–student pipeline that synthesizes accurate input–output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks—document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)—as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.

2024

pdf bib
Finding Spoken Identifications: Using GPT-4 Annotation for an Efficient and Fast Dataset Creation Pipeline
Maliha Jahan | Helin Wang | Thomas Thebaud | Yinglun Sun | Giang Ha Le | Zsuzsanna Fagyal | Odette Scharenborg | Mark Hasegawa-Johnson | Laureano Moro Velazquez | Najim Dehak
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI’s GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4’s performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4’s tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4’s performance.

2023

pdf bib
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein | Cihan Xiao | Neha Verma | Thomas Thebaud | Matthew Wiesner | Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper presents JHU’s submissions to the IWSLT 2023 dialectal and low-resource track of Tunisian Arabic to English speech translation. The Tunisian dialect lacks formal orthography and abundant training data, making it challenging to develop effective speech translation (ST) systems. To address these challenges, we explore the integration of large pre-trained machine translation (MT) models, such as mBART and NLLB-200 in both end-to-end (E2E) and cascaded speech translation (ST) systems. We also improve the performance of automatic speech recognition (ASR) through the use of pseudo-labeling data augmentation and channel matching on telephone data. Finally, we combine our E2E and cascaded ST systems with Minimum Bayes-Risk decoding. Our combined system achieves a BLEU score of 21.6 and 19.1 on test2 and test3, respectively.