MekongPhon: A Large-Scale Parallel IPA Corpus for Lao and Khmer

Ammon Shurtz, Christian Richardson, Stephen D. Richardson


Abstract
High-quality International Phonetic Alphabet (IPA) transcriptions are a foundational resource for speech and language technologies, yet existing tools for many low-resource languages remain limited in accuracy and scope. In this work, we present MekongPhon, a large-scale, high-quality parallel IPA corpus for Lao and Khmer. The corpus contains 1.3 million Khmer and 367 thousand Lao orthographic–IPA pairs, meticulously aligned and verified. When used to train Transformer-based sequence-to-sequence models, MekongPhon enables exceptionally accurate IPA generation, achieving under 2% Character Error Rate (CER) on held-out test sets. We further introduce linguistically informed Lao and Khmer transliteration tools that offer high-speed IPA conversion, outperforming Epitran by 6-71 CER points despite trading some accuracy for efficiency. All data, code, and pretrained models are publicly released to support future research and development in low-resource language technologies.
Anthology ID:
2026.lrec-main.129
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1650–1658
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.129/
DOI:
Bibkey:
Cite (ACL):
Ammon Shurtz, Christian Richardson, and Stephen D. Richardson. 2026. MekongPhon: A Large-Scale Parallel IPA Corpus for Lao and Khmer. International Conference on Language Resources and Evaluation, main:1650–1658.
Cite (Informal):
MekongPhon: A Large-Scale Parallel IPA Corpus for Lao and Khmer (Shurtz et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.129.pdf