Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?

Jiale Chen, Xuelian Dong, Qihao Yang, Wenxiu Xie, Tianyong Hao


Abstract
Spoken-only languages are languages without a writing system. They remain excluded from modern Natural Language Processing (NLP) advancements like Large Language Models (LLMs) due to their lack of textual data. Existing NLP research focuses primarily on high-resource or written low-resource languages, leaving spoken-only languages critically underexplored. As a popular NLP paradigm, LLMs have demonstrated strong few-shot and cross-lingual generalization abilities, making them a promising solution for understanding and translating spoken-only languages. In this paper, we investigate how LLMs can translate spoken-only languages into high-resource languages by leveraging international phonetic transcription as an intermediate representation. We propose UNILANG, a unified language understanding framework that learns to translate spoken-only languages via in-context learning. Through automatic dictionary construction and knowledge retrieval, UNILANG equips LLMs with more fine-grained knowledge for improving word-level semantic alignment. To support this study, we introduce the SOLAN dataset, which consists of Bai (a spoken-only language) and its corresponding translations in a high-resource language. A series of experiments demonstrates the effectiveness of UNILANG in translating spoken-only languages, potentially contributing to the preservation of linguistic and cultural diversity. Our dataset and code will be publicly released.
Anthology ID:
2025.emnlp-main.1195
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23431–23446
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1195/
DOI:
Bibkey:
Cite (ACL):
Jiale Chen, Xuelian Dong, Qihao Yang, Wenxiu Xie, and Tianyong Hao. 2025. Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23431–23446, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription? (Chen et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1195.pdf
Checklist:
 2025.emnlp-main.1195.checklist.pdf