Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains

Long S. T. Nguyen; Tho Quan

doi:10.18653/v1/2026.findings-eacl.110

Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains

Abstract

Large Language Models (LLMs) depend on retrieval for factual grounding in Retrieval-Augmented Generation (RAG), placing Information Retrieval (IR) at the core of modern Question Answering (QA) systems. While lexical, dense, and hybrid paradigms have been extensively benchmarked in English, their relative effectiveness for Vietnamese remains insufficiently characterized, especially under realistic multi-domain settings. Existing studies are typically confined to single domains or curated datasets, limiting cross-domain comparability and obscuring paradigm-level trade-offs. We introduce the first domain-normalized, multi-domain benchmark for Vietnamese IR under a unified and reproducible evaluation protocol, spanning six domains and ten datasets across education, legal, healthcare, customer support, lifestyle reviews, and open-domain knowledge. We evaluate lexical, neural-sparse, late-interaction, dense, and hybrid paradigms across diverse Vietnamese-specific and multilingual embedding backbones, and release two QA datasets, EduCoQA and CSConDa, constructed from authentic counseling and customer-service interactions. Beyond reporting benchmark performance, we derive systematic insights into lexical–semantic hybridization, specialization versus robustness trade-offs, and the limited predictive value of model scale for retrieval effectiveness. All datasets and evaluation scripts are publicly available at https://github.com/longstnguyen/ViRE.

Anthology ID:: 2026.findings-eacl.110
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2098–2119
Language:
URL:: https://preview.aclanthology.org/issues-pwc/2026.findings-eacl.110/
DOI:: 10.18653/v1/2026.findings-eacl.110
Bibkey:
Cite (ACL):: Long S. T. Nguyen and Tho T. Quan. 2026. Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2098–2119, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains (Nguyen & Quan, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/issues-pwc/2026.findings-eacl.110.pdf
Checklist:: 2026.findings-eacl.110.checklist.pdf

PDF Cite Search Checklist Fix data