Mining Naturally Romanized Seed Corpora without Romanizations

Adrian Benton, Alexander Gutkin, Christo Kirov, Brian Roark


Abstract
While the Latin script is used informally by speakers of many languages with different native scripts, high quality Latin script corpora for such languages that reflect actual natural romanizations are scarce and often difficult to collect. In this work, we propose a method for mining romanized language corpora in languages for which we do not have any pre-existing samples of naturally romanized text, focusing on Tigrinya as a test case. First we examine the efficacy of learning romanizations for a language based on observed romanizations in other languages that use the same native script. We then extrinsically assess such methods by using a romanization model trained on Amharic data to bootstrap coverage of romanized Tigrinya in a language identification system. Manual evaluation by two L1 and one L2 Tigrinya speakers suggests our method extracts romanized Tigrinya text with acceptably high precision.
Anthology ID:
2026.lrec-main.234
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
2996–3012
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.234/
DOI:
Bibkey:
Cite (ACL):
Adrian Benton, Alexander Gutkin, Christo Kirov, and Brian Roark. 2026. Mining Naturally Romanized Seed Corpora without Romanizations. International Conference on Language Resources and Evaluation, main:2996–3012.
Cite (Informal):
Mining Naturally Romanized Seed Corpora without Romanizations (Benton et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.234.pdf