Johanna Cronenberg

2026

The Added Value of Metadata and Annotations: Evidence from Two Large-Scale, Naturalistic Corpus Studies
Anisia Popescu | Johanna Cronenberg | Ioana Vasilescu | Ioana Chitoran | Lori Lamel | Martine Adda-Decker
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents two case studies that highlight both the challenges and benefits of working with large-scale, naturalistic phonetic data. Our aim is to encourage researchers not to shy away from phonetic data found “in the wild”, even when such data are messy, noisy, or incomplete – because they can yield robust, novel insights beyond the reach of controlled laboratory studies. We focus on challenges that are endemic to large corpora, including degraded audio quality, sparse or inconsistent annotations, and missing speaker metadata. By comparing two corpus-based studies that diverge in methodology and statistical design, we show how different approaches can mitigate these limitations while still extracting meaningful patterns.

Co-authors

Venues

LREC1

Fix author