Jesus Villalba


2026

Demographic bias in the performance of speech and language technology has been an active area of recent research. A lot of studies have shown the existence of demographic biases in Automatic Speech Recognition (ASR) systems. In this work, we propose a novel model-agnostic and demographic label-agnostic approach, called DARe, to mitigate any existing bias in an ASR system towards certain speaker groups. We built a debiasing module that goes between the feature extractor of an ASR and the rest of that ASR. The module includes content-group disentanglers to separate content and group, a demographic classifier, and adversarial reweighting. To eliminate the need for demographic labels, we generated pseudo-group labels by extracting speaker embeddings and clustering them. We worked with three ASR systems–Wav2Vec2 base, SEW tiny, and Whisper small. We used the FAI dataset, which contains naturalistic conversations with speakers who self-identify their demographic attributes. We used Word Error Rate (WER) as a metric of ASR performance and a Poisson regression-based approach to evaluate the racial fairness of the models. We compared the racial bias of the models before and after applying our proposed approach and observed a significant improvement in fairness.

2025

We present Paired by the Teacher (PbT), a two-stage teacher–student pipeline that synthesizes accurate input–output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks—document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)—as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.