Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models

Pakhapoom Sarapat, Trapoom Ukarapol, Tatsunori Hashimoto


Abstract
This paper presents a comprehensive study on the multilingual adaptability of large language models (LLMs), with a focus on the interplay between training strategies and prompt design. Using Thai as a case study, we examine: (RQ1) the extent to which pre-trained models (Base) can adapt to another language through additional fine-tuning; (RQ2) how continual pre-training (CPT) compares to multilingual pre-training (MLLM) in terms of performance on downstream tasks; and (RQ3) how language variation within different components of a structured prompt–task instruction, context input, and output instruction–influences task performance in cross-lingual settings. Our findings reveal that CPT proves to be a promising strategy for enhancing model performance in languages other than English like Thai in monolingual settings, particularly for models that initially lack strong linguistic capabilities. Its effectiveness, however, is highly task-dependent and varies based on the base model’s initial proficiency. In cross-lingual scenarios, MLLMs exhibit superior robustness compared to Base and CPT models, which are more susceptible to context-output language mismatches. Considering the high cost of training multilingual models from scratch, MLLMs remain a critical component for downstream tasks in multilingual settings due to their strong cross-lingual performance.
Anthology ID:
2025.chomps-main.5
Volume:
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Aman Sinha, Raúl Vázquez, Timothee Mickus, Rohit Agarwal, Ioana Buhnila, Patrícia Schmidtová, Federica Gamba, Dilip K. Prasad, Jörg Tiedemann
Venues:
CHOMPS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
49–59
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.chomps-main.5/
DOI:
Bibkey:
Cite (ACL):
Pakhapoom Sarapat, Trapoom Ukarapol, and Tatsunori Hashimoto. 2025. Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models. In Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025), pages 49–59, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models (Sarapat et al., CHOMPS 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.chomps-main.5.pdf