Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao (Editors)

Anthology ID:: 2025.loresmt-1
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, U.S.A.
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/landing_page/2025.loresmt-1/
DOI:: 10.18653/v1/2025.loresmt-1
ISBN:: 979-8-89176-230-5
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/landing_page/2025.loresmt-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Comparative Evaluation of Machine Translation Models Using Human-Translated Social Media Posts as References: Human-Translated Datasets
Shareefa Ahmed Al Amer | Mark G. Lee | Phillip Smith

Machine translation (MT) of social media text presents unique challenges due to its informal nature, linguistic variations, and rapid evolution of language trends. In this paper, we propose a human-translated English dataset to Arabic, Italian, and Spanish, and a human-translated Arabic dataset to Modern Standard Arabic (MSA) and English. We also perform a comprehensive analysis of three publicly accessible MT models using human translations as a reference. We investigate the impact of social media informality on translation quality by translating the MSA version of the text and comparing BLEU and METEOR scores with the direct translation of the original social media posts. Our findings reveal that MarianMT provides the closest translations to human for Italian and Spanish among the three models, with METEOR scores of 0.583 and 0.640, respectively, while Google Translate provides the closest translations for Arabic, with a METEOR score of 0.354. By comparing the translation of the original social media posts with the MSA version, we confirm that the informality of social media text significantly impacts translation quality, with an increase of 12 percentage points in METEOR scores over the original posts. Additionally, we investigate inter-model alignment and the degree to which the output of these MT models align.

pdf bib abs
Enhanced Zero-Shot Machine Translation via Fixed Prefix Pair Bootstrapping
Van-Hien Tran | Masao Utiyama

Zero-shot in-context learning allows large language models (LLMs) to perform tasks using only provided instructions. However, pre-trained LLMs often face calibration issues in zero-shot scenarios, leading to challenges such as hallucinations and off-target translations that compromise output quality, particularly in machine translation (MT). This paper introduces a new method to improve zero-shot MT using fixed prefix pair bootstrapping. By initializing translations with an accurate bilingual prefix pair at the start of both source and target sentences, this approach effectively guides the model to generate precise target-language outputs. Extensive evaluations across four model architectures and multiple translation directions demonstrate significant and consistent improvements, showcasing the potential of this straightforward strategy to enhance zero-shot MT performance.

pdf bib abs
UTER: Capturing the Human Touch in Evaluating Morphologically Rich and Low-Resource Languages
Samy Ouzerrout

We introduce UTER, a novel automatic translation evaluation metric specifically designed for morphologically complex languages. Unlike traditional TER approaches, UTER incorporates a reordering algorithm and leverages the Sørensen-Dicse similarity measure to better account for morphological variations.Tested on morphologically rich and low resource languages from the WMT22 dataset, such as Finnish, Estonian, Kazakh, and Xhosa, UTER delivers results that align more closely with human direct assessments (DA) and outperforms benchmark metrics, including chrF and METEOR. Furthermore, its effectiveness has also been demonstrated on languages with complex writing systems, such as Chinese and Japanese, showcasing its versatility and robustness.

pdf bib abs
From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments
Bushi Xiao | Qian Shen | Daisy Zhe Wang

In this study, we propose a novel paradigm for multi-modal low resource language dataset generation that eliminates dependency on existing parallel multi-modal datasets. Leveraging advances in large image-generation models, we introduce a systematic pipeline that transforms text-only parallel corpora into rich multi-modal translation datasets. We then validate the generated content through human evaluation. We design and implement a new MMT model framework suitable for our new generated dataset. The model contains a verification mechanism with a large language model to ensure consistency between visual content and textual translations. Experimental results across four African low-resource languages with less than 10k training corpus demonstrate significant improvements over NLLB baselines, with average gains of up to 9.8% in BLEU score and 4.3% in METEOR score. Our method shows particular effectiveness in correctly translating concrete objects and contextual elements, suggesting its potential for improving low-resource machine translation through visual grounding.

pdf bib abs
Wenzhou Dialect Speech to Mandarin Text Conversion
Zhipeng Gao | Akihiro Tamura | Tsuneo Kato

The Wenzhou dialect is a Chinese dialect that is significantly distinct from Mandarin, the official language of China. It is among the most complex Chinese dialects and is nearly incomprehensible to people from regions such as Northern China, thereby creating substantial communication barriers. Therefore, the conversion between the Wenzhou dialect and Mandarin is essential to facilitate communication between Wenzhou dialect speakers and those from other Chinese regions. However, as a low-resource language, the Wenzhou dialect lacks publicly available datasets, and such conversion technologies have not been extensively researched. Thus, in this study, we create a parallel dataset containing Wenzhou dialect speech and the corresponding Mandarin text and build benchmark models for Wenzhou dialect speech-to-Mandarin text conversion. In particular, we fine-tune two self-supervised learning-based pretrained models, that is, TeleSpeech-ASR1.0 and Wav2Vec2-XLS-R, with our training dataset and report their performance on our test dataset as baselines for future research.

pdf bib abs
Fostering Digital Inclusion for Low-Resource Nigerian Languages: A Case Study of Igbo and Nigerian Pidgin
Ebelechukwu Nwafor | Minh Phuc Nguyen

Current state-of-the-art large language models (LLMs) like GPT-4 perform exceptionally well in language translation tasks for high-resource languages, such as English, but often lack high accuracy results for low-resource African languages such as Igbo and Nigerian Pidgin, two native languages in Nigeria. This study addresses the need for Artificial Intelligence (AI) linguistic diversity by creating benchmark datasets for Igbo-English and Nigerian Pidgin-English language translation tasks. The dataset developed is curated from reputable online sources and meticulously annotated by crowd-sourced native-speaking human annotators. Using the datasets, we evaluate the translation abilities of GPT-based models alongside other state-of-the-art translation models specifically designed for low-resource languages. Our results demonstrate that current state-of-the-art models outperform GPT-based models in translation tasks. In addition, these datasets can significantly enhance LLM performance in these translation tasks, marking a step toward reducing linguistic bias and promoting more inclusive AI models.

pdf bib abs
Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Raphael Merx | Adérito José Guterres Correia | Hanna Suominen | Ekaterina Vylomova

Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues.Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

pdf bib abs
Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation
Junyoung Lee | Marco Cognetta | Sangwhan Moon | Naoaki Okazaki

Subword tokenization, where text is represented in an intermediate form between full words and characters, is ubiquitous in modern NLP due to its ability to represent any input sentence with a small vocabulary. However for Korean, where there are 11,172 base characters (*syllables*) in its alphabet, it is difficult to have a vocabulary large enough to succinctly encode text while fitting within parameter-budget constraints. This motivates us to explore an alternative representation for Korean which relies on the decompositional nature of Korean syllables: a syllable can be uniquely decomposed into a sequence of two or three subcharacters (*jamo*), of which there are only 68.Using jamo as the basis for subword tokenization (e.g., byte-pair encoding) leads to shorter tokenized sequences with fewer vocabulary parameters, exposes the model to sub-syllable-level morphological information, and increases the amount of augmentation gained from subword regularization. We evaluate jamo-level subword tokenization on several Korean translation tasks and find that jamo-level subword models consistently outperform syllable- and byte-level models in low-resource and restricted-vocabulary settings.

pdf bib abs
Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs
Itai Mondshine | Tzuf Paz-Argaman | Reut Tsarfaty

Despite advances in the multilingual capabilities of Large Language Models (LLMs) across diverse tasks, English remains the dominant language for LLM research and development. So, when working with a different language, this has led to the widespread practice of pre-translation, i.e., translating the task prompt into English before inference. Selective pre-translation, a more surgical approach, focuses on translating specific prompt components. However, its current use is sporagic and lacks a systematic research foundation. Consequently, the optimal pre-translation strategy for various multilingual settings and tasks remains unclear. In this work, we aim to uncover the optimal setup for pre-translation by systematically assessing its use. Specifically, we view the prompt as a modular entity, composed of four functional parts: instruction, context, examples, and output, either of which could be translated or not. We evaluate pre-translation strategies across 35 languages covering both low and high-resource languages, on various tasks including Question Answering (QA), Natural Language Inference (NLI), Named Entity Recognition (NER), and Abstractive Summarization. Our experiments show the impact of factors as similarity to English, translation quality and the size of pre-trained data, on the model performance with pre-translation. We suggest practical guidelines for choosing optimal strategies in various multilingual settings

We introduce ModeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language’s grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, ModeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that ModeLing can be used to measure further progress in linguistic reasoning.

pdf bib abs
Multilingual State Space Models for Structured Question Answering in Indic Languages
Arpita Vats | Rahul Raja | Mrinal Mathur | Aman Chadha | Vinija Jain

The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs) to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. Furthermore, we propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.

pdf bib abs
Parallel Corpora for Machine Translation in Low-Resource Indic Languages: A Comprehensive Review
Rahul Raja | Arpita Vats

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content. We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

pdf bib abs
Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models
Umer Butt | Stalin Varanasi | Günter Neumann

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu→Roman-Urdu and 97.44 for Roman-Urdu→Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.

pdf bib abs
Building Data Infrastructure for Low-Resource Languages
Sarah K. K. Luger | Rafael Mosquera | Pedro Ortiz Suarez

The MLCommons Datasets Working Group presents a comprehensive initiative to advance the development and accessibility of artificial intelligence (AI) training and testing resources. This paper introduces three key projects aimed at addressing critical gaps in the AI data ecosystem: the Unsupervised People’s Speech Dataset, containing over 821,000 hours of speech across 89+ languages; a strategic collaboration with Common Crawl to enhance web crawling capabilities for low-resource languages; and a framework for knowledge graph extraction evaluation. By focusing on languages other than English (LOTE) and creating permissively licensed, high-quality datasets, these initiatives aim to democratize AI development and improve model performance across diverse linguistic contexts. This work represents a significant step toward more inclusive and capable AI systems that can serve global communities.

pdf bib abs
Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation
Menan Velayuthan | Nisansa De Silva | Surangika Ranathunga

Domain adaptation in Neural Machine Translation (NMT) is commonly achieved through fine-tuning, but this approach becomes inefficient as the number of domains increases. Knowledge distillation (KD) provides a scalable alternative by training a compact model on distilled data from a larger model. However, we hypothesize that vanilla sequence-level KD primarily distills the decoder while neglecting encoder knowledge, leading to suboptimal knowledge transfer and limiting its effectiveness in low-resource settings, where both data and computational resources are constrained. To address this, we propose an improved sequence-level KD method that enhances encoder knowledge transfer through a cosine-based alignment loss. Our approach first trains a large model on a mixed-domain dataset and generates a Distilled Mixed Dataset (DMD). A small model is then trained on this dataset via sequence-level KD with encoder alignment. Experiments in a low-resource setting validate our hypothesis, demonstrating that our approach outperforms vanilla sequence-level KD, improves generalization to out-of-domain data, and facilitates efficient domain adaptation while reducing model size and computational cost.

The Pahlavi language, aka Middle Persian, is a critical part of Persian cultural and historical heritage which bridges the Old Persian and Modern Persian (Farsi). However, due to its limited digital presence and the scarcity of comprehensive linguistic resources, Pahlavi is at risk of extinction. As an early attempt to preserve this language, this study introduces a framework to translate English text into Pahlavi. Our approach combines grammar-guided term extraction with zero-shot translation, leveraging large language models (LLMs) to generate syntactically and semantically accurate Pahlavi sentences.This framework aims to preserve the Pahlavi language and serves as a model for reviving other endangered languages with similar characteristics. Finally using our framework, we generate a novel dataset of 360 expert-validated parallel English-Pahlavi texts.

pdf bib abs
Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole
Jacqueline Rowe | Edward Gow-Smith | Mark Hepple

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah’s Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.