Distinguishing Romanized Hindi from Romanized Urdu

Elizabeth Nielsen, Christo Kirov, Brian Roark


Abstract
We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin script by language identification systems, they are typically conflated. In the absence of large labeled collections of such text, we consider methods for generating training data. Beginning with a small set of seed words, each of which are strongly indicative of one of the languages versus the other, we prompt a pretrained large language model (LLM) to generate romanized text. Treating text generated from an Urdu prompt as one class and text generated from a Hindi prompt as the other class, we build a binary language identification (LangID) classifier. We demonstrate that the resulting classifier distinguishes manually romanized Urdu Wikipedia text from manually romanized Hindi Wikipedia text far better than chance. We use this classifier to estimate the prevalence of Urdu in a large collection of text labeled as romanized Hindi that has been used to train large language models. These techniques can be applied to bootstrap classifiers in other cases where a dataset is known to contain multiple distinct but related classes, such as different dialects of the same language, but for which labels cannot easily be obtained.
Anthology ID:
2023.cawl-1.5
Volume:
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Kyle Gorman, Richard Sproat, Brian Roark
Venue:
CAWL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33–42
Language:
URL:
https://aclanthology.org/2023.cawl-1.5
DOI:
10.18653/v1/2023.cawl-1.5
Bibkey:
Cite (ACL):
Elizabeth Nielsen, Christo Kirov, and Brian Roark. 2023. Distinguishing Romanized Hindi from Romanized Urdu. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 33–42, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Distinguishing Romanized Hindi from Romanized Urdu (Nielsen et al., CAWL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2023.cawl-1.5.pdf