Fast, Not Fancy: Rethinking G2P with Rich Data and Statistical Models

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee


Abstract
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift—utilizing rich offline datasets to inform the development of fast, statistical methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30 percentage-point improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.
Anthology ID:
2025.findings-emnlp.1218
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22382–22408
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1218/
DOI:
10.18653/v1/2025.findings-emnlp.1218
Bibkey:
Cite (ACL):
Mahta Fetrat Qharabagh, Zahra Dehghanian, and Hamid R. Rabiee. 2025. Fast, Not Fancy: Rethinking G2P with Rich Data and Statistical Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22382–22408, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Fast, Not Fancy: Rethinking G2P with Rich Data and Statistical Models (Qharabagh et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1218.pdf
Checklist:
 2025.findings-emnlp.1218.checklist.pdf