Phonetic Segmentation of the UCLA Phonetics Lab Archive

Eleanor Chodroff, Blaž Pažon, Annie Baker, Steven Moran


Abstract
Research in speech technologies and comparative linguistics depends on access to diverse and accessible speech data. The UCLA Phonetics Lab Archive is one of the earliest multilingual speech corpora, with long-form audio recordings and phonetic transcriptions for 314 languages (Ladefoged et al., 2009). Recently, 95 of these languages were time-aligned with word-level phonetic transcriptions (Li et al., 2021). Here we present VoxAngeles, a corpus of audited phonetic transcriptions and phone-level alignments of the UCLA Phonetics Lab Archive, which uses the 95-language CMU re-release as our starting point. VoxAngeles also includes word- and phone-level segmentations from the original UCLA corpus, as well as phonetic measurements of word and phone durations, vowel formants, and vowel f0. This corpus enhances the usability of the original data, particularly for quantitative phonetic typology, as demonstrated through a case study of vowel intrinsic f0. We also discuss the utility of the VoxAngeles corpus for general research and pedagogy in crosslinguistic phonetics, as well as for low-resource and multilingual speech technologies. VoxAngeles is free to download and use under a CC-BY-NC 4.0 license.
Anthology ID:
2024.lrec-main.1114
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
12724–12733
Language:
URL:
https://aclanthology.org/2024.lrec-main.1114
DOI:
Bibkey:
Cite (ACL):
Eleanor Chodroff, Blaž Pažon, Annie Baker, and Steven Moran. 2024. Phonetic Segmentation of the UCLA Phonetics Lab Archive. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12724–12733, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Phonetic Segmentation of the UCLA Phonetics Lab Archive (Chodroff et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.lrec-main.1114.pdf