Hoang D. Nguyen

2026

VIVID: A Culturally Grounded Benchmark Exposing the Figurative Language Gap in Vietnamese NLP
Tu Tran Do | Nhat Ngoc Nguyen | Tung Khanh Tran | Hoang D. Nguyen | Tu Minh Phuong | Long Hoang Dang
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We present VIVID (Vietnamese Idioms for Validation and Interpretation Depth), the first systematic benchmark for evaluating culturally grounded figurative language understanding in Vietnamese. VIVID comprises 1,636 idioms and proverbs annotated with five complexity traits (literal expressions, pragmatic nuances, Sino-Vietnamese terms, uncommon vocabulary, folk knowledge) and seven semantic themes. We establish an evaluation framework combining generative and discriminative tasks, proposing an LLM-as-a-Judge approach with aspect-based prompting validated against human judgment (Cohen’s κ = 0.792). Evaluating eight state-of-the-art models reveals critical gaps: Vietnamese-specialized models drastically underperform multilingual systems (VinaLLaMA-7B: 0.13 vs. GPT-4o: 2.46), and even top models achieve less than 50% of maximum scores. Notably, few-shot prompting does not universally improve performance, with GPT-4o exhibiting degradation due to stylistic overfitting. Our analysis exposes systematic failures including literal over-interpretation, lexical gaps, and pragmatic flattening, demonstrating that current models lack cultural competence for nuanced figurative interpretation. VIVID provides an essential tool for advancing figurative language understanding in culturally rich contexts.

pdf bib abs

Questionnaire Meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses
Duc-Hai Nguyen | Vijayakumar Nanjappan | Barry O'Sullivan | Hoang D. Nguyen
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Millions of people take surveys every day, from market polls to medical questionnaires and customer feedback forms. These datasets capture valuable insights, but the ability of large language models (LLMs) to process questionnaire data, where lists of questions are crossed with hundreds of respondent rows, remains underexplored. Current survey analysis tools (e.g., Qualtrics, SPSS, REDCap) are designed for human operators, leaving practitioners without evidence-based guidance on how to best represent questionnaires for LLM consumption. We address this gap by introducing QASU (Questionnaire Analysis and Structural Understanding), a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference, across six serialization formats and multiple prompt strategies. Experiments on five LLMs (GPT-5-mini, Gemini-2.5-Flash, Qwen3-32B, Llama3-70B, Amazon Nova Lite) show that format choice significantly impacts performance, with up to 9 percentage points improvement over baseline formats, and reveal substantial gaps (10 to 30 percentage points) between proprietary and open-weight models. Self-augmented prompting yields model-dependent benefits, proving effective for proprietary models but unreliable for open-weight alternatives. By systematically isolating format and prompting effects, our open-source benchmark offers practical guidance for advancing both research and real-world practice in LLM-based questionnaire analysis.

pdf bib abs

We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the first dataset and framework designed for fine-grained evaluation of linguistic competence in the Irish language, an endangered language. Drawing on a variety of linguistic literature and grammar reference works, a team of fluent Irish speakers manually constructed and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features. We evaluate both existing Large Language Models (LLMs) and fluent human participants on their syntactic knowledge of Irish. Our findings show that humans outperform all models across all linguistic features, achieving 16.6% higher accuracy on average. Moreover, a substantial performance gap of 18.1% persists between open- and closed-source LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy compared to 90.1% by human. Interestingly, human participants and models struggle on different aspects of Irish grammar, thus highlighting a difference in representation learned by the models. Overall, Irish-BLiMP provides the first systematic framework for evaluating the grammatical competence of LLMs in Irish and offers a valuable benchmark for advancing research on linguistic understanding in low-resource languages.

pdf bib abs

LaCoMSA: Language-Consistency Multilingual Self-Alignment with Latent Representation Rewarding
Khanh-Tung Tran | Barry O'Sullivan | Hoang D. Nguyen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have achieved impressive performance yet remain inconsistent across languages, often defaulting to high-resource outputs such as English. Existing multilingual alignment methods mitigate these issues through preference optimization but rely on external supervision, such as translation systems or English-biased signal. We propose Multilingual Self-Alignment (MSA), a targeted preference optimization framework that leverages an LLM’s own latent representations as intrinsic supervision signals, rewarding lower-resource language outputs based on their alignment with high-resource (English) counterparts in the "semantic hub". We further introduce Language-Consistency MSA (LaCoMSA), which augments MSA with a final-layer language-consistency factor to prevent off-target generation. Integrated with Direct Preference Optimization, LaCoMSA improves a Llama 3 8B-based model multilingual win rates by up to 6.8% absolute (55.0% relatively) on X-AlpacaEval and achieves consistent gains across benchmarks and models. Our findings demonstrate that LaCoMSA can serve as an effective and scalable mechanism, opening a new venue toward multilingual self-alignment.

2025

pdf bib abs

Disentangling Language Understanding and Reasoning Structures in Cross-lingual Chain-of-Thought Prompting
Khanh-Tung Tran | Nguyet-Hang Vu | Barry O’Sullivan | Hoang D. Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2025

Cross-lingual chain-of-thought prompting techniques have proven effective for investigating diverse reasoning paths in Large Language Models (LLMs), especially for low-resource languages. Despite these empirical gains, the mechanisms underlying cross-lingual improvements remain perplexing. This study, therefore, addresses whether the benefits of cross-lingual prompting arise from language-specific reasoning structures intrinsic to each language, or are simply a consequence of improved comprehension through cross-linguistic exposure. We employ neuron intervention and perturbation techniques to analyze and deactivate language-specific reasoning neurons during cross-lingual prompting, leading to performance disparities across languages, up to 27.4%. Our findings disentangle that these neurons are essential for reasoning in their respective languages, but have minimal effect on reasoning in other languages, providing evidence for the existence of language-specific local reasoning structures and guiding the development of more interpretable and effective multilingual AI systems.

2020

pdf bib

Venues

Fix author

Hoang D. Nguyen

2026

2025

2020

Co-authors

Venues