A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives

Farhan Samir, Emily P. Ahn, Shreya Prakash, Márton Soskuthy, Vered Shwartz, Jian Zhu


Abstract
Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language partitions being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable partitions. By annotating only 20 samples for a language partition, we are able to identify systematic transcription errors for 10 language partitions in a recent large multilingual transcribed audio archive, X-IPAPack (Zhu et al., 2024). We find that filtering these low-quality partitions out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our work contributes an effective method for auditing multilingual audio archives.1
Anthology ID:
2025.tacl-1.29
Volume:
Transactions of the Association for Computational Linguistics, Volume 13
Month:
Year:
2025
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
595–612
Language:
URL:
https://preview.aclanthology.org/corrections-2025-07/2025.tacl-1.29/
DOI:
10.1162/tacl_a_00759
Bibkey:
Cite (ACL):
Farhan Samir, Emily P. Ahn, Shreya Prakash, Márton Soskuthy, Vered Shwartz, and Jian Zhu. 2025. A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives. Transactions of the Association for Computational Linguistics, 13:595–612.
Cite (Informal):
A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives (Samir et al., TACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-07/2025.tacl-1.29.pdf