Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan; Raphael Merx; Jey Han Lau

Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan, Raphael Merx, Jey Han Lau

Abstract

Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using **Dhao**, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a **hybrid framework** where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the **number of retrieved examples** rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.

Anthology ID:: 2026.loresmt-1.7
Volume:: Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 87–101
Language:
URL:: https://preview.aclanthology.org/manual-author-scripts/2026.loresmt-1.7/
DOI:
Bibkey:
Cite (ACL):: David Samuel Setiawan, Raphael Merx, and Jey Han Lau. 2026. Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG. In Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026), pages 87–101, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG (Setiawan et al., LoResMT 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/manual-author-scripts/2026.loresmt-1.7.pdf

PDF Cite Search Fix data