Adrian Denzel Macayan

2026

Tao–Filipino Neural Machine Translation: Strategies for Ultra–Low-Resource Settings
Adrian Denzel Macayan | Luis Andrew Sunga Madridijo | Ellexandrei Esponilla | Zachary Mitchell Francisco
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

Neural Machine Translation (NMT) performance degrades significantly in ultra-low resource settings, particularly for endangeredlanguages like Tao (Yami) which lack extensive parallel corpora. This study investigates strategies to bootstrap a Tao-Tagalog translation system using the NLLB-200 (600 million parameter) model under extremely limited supervision. We propose a multi-faceted approach combining domain-specific fine-tuning, synthetic data augmentation, and cross-lingual transfer learning. Specifically, we leverage the phylogenetic proximity of Ivatan, a related Batanic language, to pre-train the model, and utilize dictionary-based generation to construct synthetic conversational data. Our results demonstrate that transfer learning from Ivatan improves translation quality on in-domain religious texts, achieving a BLEU score of 34.85. Conversely, incorporating synthetic data enhances the model’s ability to generalize to conversational contexts, mitigating the domain bias often inherent in religious corpora. These findings highlight the effectiveness of exploiting linguistic typology and structured lexical resources to develop functional NMT systems for under-represented Austronesian languages.

Co-authors

Venues

LoResMT1
WS1

Fix author