Hoang D. Nguyen
2026
Disentangling Continued Pre-Training: Attention-Driven Routing and Semantic Hub Preservation in Language Adaptation
Khanh-Tung Tran | Vinh-Khanh Tran | Barry O’Sullivan | Hoang D. Nguyen
Findings of the Association for Computational Linguistics: ACL 2026
Khanh-Tung Tran | Vinh-Khanh Tran | Barry O’Sullivan | Hoang D. Nguyen
Findings of the Association for Computational Linguistics: ACL 2026
Continued Pre-Training (CPT) enables Large Language Models (LLMs) to acquire second-language capabilities, yet the underlying mechanisms remain poorly understood. In this work, we investigate how CPT adapts model representations across diverse language families and scripts, model sizes, and architectures. We find that second-language abilities emerge through a selective adaptation mechanism: task-solving capabilities are preserved in “semantic hub”, while interface layers retarget to shifted token distributions. Layer-swapping experiments demonstrate that semantic understanding can be surgically transferred between base and CPT models with minimal loss (e.g., swapping 50% of model parameters reduces performance by only 0.3%). Furthermore, we establish that attention components route language adaptation: larger parameter changes than feedforward networks, correlate more strongly with language-specific neurons, and their surgical replacement substantially degrades performance. Overall, our work provides a mechanistic understanding of CPT, guiding future work on efficient strategies for language adaptation.
LaCoMSA: Language-Consistency Multilingual Self-Alignment with Latent Representation Rewarding
Khanh-Tung Tran | Barry O'Sullivan | Hoang D. Nguyen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Khanh-Tung Tran | Barry O'Sullivan | Hoang D. Nguyen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have achieved impressive performance yet remain inconsistent across languages, often defaulting to high-resource outputs such as English. Existing multilingual alignment methods mitigate these issues through preference optimization but rely on external supervision, such as translation systems or English-biased signal. We propose Multilingual Self-Alignment (MSA), a targeted preference optimization framework that leverages an LLM’s own latent representations as intrinsic supervision signals, rewarding lower-resource language outputs based on their alignment with high-resource (English) counterparts in the "semantic hub". We further introduce Language-Consistency MSA (LaCoMSA), which augments MSA with a final-layer language-consistency factor to prevent off-target generation. Integrated with Direct Preference Optimization, LaCoMSA improves a Llama 3 8B-based model multilingual win rates by up to 6.8% absolute (55.0% relatively) on X-AlpacaEval and achieves consistent gains across benchmarks and models. Our findings demonstrate that LaCoMSA can serve as an effective and scalable mechanism, opening a new venue toward multilingual self-alignment.
2025
Disentangling Language Understanding and Reasoning Structures in Cross-lingual Chain-of-Thought Prompting
Khanh-Tung Tran | Nguyet-Hang Vu | Barry O’Sullivan | Hoang D. Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2025
Khanh-Tung Tran | Nguyet-Hang Vu | Barry O’Sullivan | Hoang D. Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2025
Cross-lingual chain-of-thought prompting techniques have proven effective for investigating diverse reasoning paths in Large Language Models (LLMs), especially for low-resource languages. Despite these empirical gains, the mechanisms underlying cross-lingual improvements remain perplexing. This study, therefore, addresses whether the benefits of cross-lingual prompting arise from language-specific reasoning structures intrinsic to each language, or are simply a consequence of improved comprehension through cross-linguistic exposure. We employ neuron intervention and perturbation techniques to analyze and deactivate language-specific reasoning neurons during cross-lingual prompting, leading to performance disparities across languages, up to 27.4%. Our findings disentangle that these neurons are essential for reasoning in their respective languages, but have minimal effect on reasoning in other languages, providing evidence for the existence of language-specific local reasoning structures and guiding the development of more interpretable and effective multilingual AI systems.
2020
ReINTEL: A Multimodal Data Challenge for Responsible Information Identification on Social Network Sites
Duc-Trong Le | Xuan-Son Vu | Nhu-Dung To | Huu-Quang Nguyen | Thuy-Trinh Nguyen | Thi Khanh-Linh Le | Anh-Tuan Nguyen | Minh-Duc Hoang | Nghia Le | Huyen Nguyen | Hoang D. Nguyen
Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing
Duc-Trong Le | Xuan-Son Vu | Nhu-Dung To | Huu-Quang Nguyen | Thuy-Trinh Nguyen | Thi Khanh-Linh Le | Anh-Tuan Nguyen | Minh-Duc Hoang | Nghia Le | Huyen Nguyen | Hoang D. Nguyen
Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing