Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains

Long S. T. Nguyen, Tho Quan


Abstract
Large Language Models (LLMs) have achieved remarkable progress, yet their reliance on parametric knowledge often leads to hallucinations. Retrieval-Augmented Generation (RAG) mitigates this issue by grounding outputs in external documents, where the quality of retrieval is critical. While retrieval methods have been widely benchmarked in English, it remains unclear which approaches are most effective for Vietnamese, a language characterized by informal queries, noisy documents, and limited resources. Prior studies are restricted to clean datasets or narrow domains, leaving fragmented insights. To the best of our knowledge, we present the first comprehensive benchmark of retrieval methods for Vietnamese across multiple real-world domains. We systematically compare lexical, dense, and hybrid methods on datasets spanning education, legal, healthcare, customer support, lifestyle, and Wikipedia, and introduce two new datasets capturing authentic educational counseling and customer service interactions. Beyond reporting benchmark numbers, we distill a set of empirical insights that clarify trade-offs, highlight domain-specific challenges, and provide practical guidance for building robust Vietnamese QA systems. Together, these contributions offer the first large-scale, practice-oriented perspective on Vietnamese retrieval and inform both academic research and real-world deployment in low-resource languages. All datasets and evaluation scripts are available at https://github.com/longstnguyen/ViRE.
Anthology ID:
2026.findings-eacl.110
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2098–2119
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.110/
DOI:
Bibkey:
Cite (ACL):
Long S. T. Nguyen and Tho Quan. 2026. Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2098–2119, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains (Nguyen & Quan, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.110.pdf
Checklist:
 2026.findings-eacl.110.checklist.pdf