Pume Tuchinda
2025
WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat
|
Pume Tuchinda
|
Lalita Lowphansirikul
|
Surapon Nonesung
|
Panuthep Tasawong
|
Alham Fikri Aji
|
Can Udomcharoenchaikit
|
Sarana Nutanong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee
|
Pume Tuchinda
|
Jinheon Baek
|
Samuel Cahyawijaya
|
Can Udomcharoenchaikit
|
Potsawee Manakul
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2025
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.