uppdf
bib
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Ernesto Luis Estevanell-Valladares
|
Alicia Picazo-Izquierdo
|
Tharindu Ranasinghe
|
Besik Mikaberidze
|
Simon Ostermann
|
Daniil Gurgurov
|
Philipp Mueller
|
Claudia Borg
|
Marián Šimko
pdf
bib
abs
Bridging the Gap: Leveraging Cherokee to Improve Language Identification for Endangered Iroquoian Languages
Liam Enzo Eggleston
|
Michael P. Cacioli
|
Jatin Sarabu
|
Ivory Yang
|
Kevin Zhu
Language identification is a foundational task in natural language processing (NLP), yet many Indigenous languages remain entirely unsupported by commercial language identification systems. In this study, we assess the performance of Google LangID on a 5k Cherokee dataset and find that every sentence is classified as “undetermined”, indicating a complete failure to even misidentify Cherokee as another language. To further explore this issue, we manually constructed the first digitalized Northern Iroquoian dataset, consisting of 120 sentences across five related languages: Onondaga, Cayuga, Mohawk, Seneca, and Oneida. Running these sentences through Google LangID, we examine patterns in its incorrect predictions. To address these limitations, we train a random forest classifier to successfully distinguish between these languages, demonstrating its effectiveness in language identification. Our findings underscore the inadequacies of existing commercial language identification models for Indigenous languages and highlight concrete steps toward improving automated recognition of low-resource languages.
pdf
bib
abs
Building a Lightweight Classifier to Distinguish Closely Related Language Varieties with Limited Supervision: The Case of Catalan vs Valencian
Raúl García-Cerdá
|
María Miró Maestre
|
Miquel Canal
Dialectal variation among closely related languages poses a major challenge in low-resource NLP, as their linguistic similarity increases confusability for automatic systems. We introduce the first supervised classifier to distinguish standard Catalan from its regional variety Valencian. Our lightweight approach fine-tunes a RoBERTa-base model on a manually curated corpus of 20 000 sentences—without any Valencian-specific tools—and achieves 98 % accuracy on unseen test data. In a human evaluation of 90 mixed-variety items per reviewer, acceptance rates reached 96.7 % for Valencian and 97.7 % for Catalan (97.2 % overall). We discuss limitations with out-of-distribution inputs and outline future work on confidence calibration and dialect-aware tokenization. Our findings demonstrate that high-impact dialect classification is feasible with minimal resources.
pdf
bib
abs
A thresholding method for Improving translation Quality for Indic MT task
Sudhansu Bala Das
|
Leo Raphael Rodrigues
|
Tapas Kumar Mishra
|
Bidyut Ku Patra
The conversion of content from one language to another using a computer system is known as Machine Translation (MT). Various techniques have been used to ensure effective translations that retain the contextual and lexical interpretation of the source and target languages. One of these methods is end-to-end Neural Machine Translation (NMT), which is frequently utilized in real-world machine translation systems. NMT requires large parallel datasets for effective translation. These datasets are essential for an MT system to acquire during the training phase to learn the linguistic patterns and structures of both languages. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since these datasets have been gathered from various sources, they contain many incorrect or dissimilar translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. This paper proposes an algorithm to remove dissimilar translations from the training dataset and evaluate the model’s efficiency. Two Indic languages (ILs), Hindi (HIN) and Odia (ODI), were chosen for the experiment. A baseline NMT system is built for these languages, and the effect of different dataset sizes is investigated. The quality of the translations in the experiment is evaluated using standard metrics. The results have shown that removing the dissimilar translations from the training dataset improves the quality of the language. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same dataset, ILs-English works more effectively across all the evaluation metrics.
pdf
bib
abs
A Multi-Task Learning Approach to Dialectal Arabic Identification and Translation to Modern Standard Arabic
Abdullah Khered
|
Youcef Benkhedda
|
Riza Batista-Navarro
Translating Dialectal Arabic (DA) into Modern Standard Arabic (MSA) is a complex task due to the linguistic diversity and informal nature of dialects, particularly in social media texts. To improve translation quality, we propose a Multi-Task Learning (MTL) framework that combines DA-MSA translation as the primary task and dialect identification as an auxiliary task. Additionally, we introduce LahjaTube, a new corpus containing DA transcripts and corresponding MSA and English translations, covering four major Arabic dialects: Egyptian (EGY), Gulf (GLF), Levantine (LEV), and Maghrebi (MGR), collected from YouTube. We evaluate AraT5 and AraBART on the Dial2MSA-Verified dataset under Single-Task Learning (STL) and MTL setups. Our results show that adopting the MTL framework and incorporating LahjaTube into the training data improve the translation performance, leading to a BLEU score improvement of 2.65 points over baseline models.
pdf
bib
abs
Low-Resource Machine Translation for Moroccan Arabic
Alexei Rosca
|
Abderrahmane Issam
|
Gerasimos Spanakis
Neural Machine Translation (NMT) has achieved significant progress especially for languages with large amounts of data (referred to as high resource languages). However, most of the world languages lack sufficient data and are thus considered as low resource or endangered. Previous research explored various techniques for improving NMT performance on low resource languages, with no guarantees that they will perform similarly on other languages. In this work, we explore various low resource NMT techniques for improving performance on Moroccan Arabic (Darija), a dialect of Arabic that is considered a low resource language. We experiment with three techniques that are prominent in low resource Natural Language Processing (NLP), namely: back-translation, paraphrasing and transfer learning. Our results indicate that transfer learning, especially in combination with back-translation is effective at improving translation performance on Moroccan Arabic, achieving a BLEU score of 26.79 on Darija to English and 9.98 on English to Darija.
pdf
bib
abs
Efficient Architectures For Low-Resource Machine Translation
Edoardo Signoroni
|
Pavel Rychly
|
Ruggero Signoroni
Low-resource Neural Machine Translation is highly sensitive to hyperparameters and needs careful tuning to achieve the best results with small amounts of training data. We focus on exploring the impact of changes in the Transformer architecture on downstream translation quality, and propose a metric to score the computational efficiency of such changes. By experimenting on English-Akkadian, German-Lower Sorbian, English-Italian, and English-Manipuri, we confirm previous finding in low-resource machine translation optimization, and show that smaller and more parameter-efficient models can achieve the same translation quality of larger and unwieldy ones at a fraction of the computational cost. Optimized models have around 95% less parameters, while dropping only up to 14.8% ChrF. We compile a list of optimal ranges for each hyperparameter.
pdf
bib
abs
IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva
|
Ivelina Stoyanova
|
Jordan Konstantinov Kralev
The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.
pdf
bib
abs
Modular Training of Deep Neural Networks for Text Classification in Guarani
Jose Luis Vazquez
|
Carlos Ulises Valdez
|
Marvin Matías Agüero-Torales
|
Julio César Mello-Román
|
Jose Domingo Colbes
|
Sebastian Alberto Grillo
We present a modular training approach for deep text classification in Guarani, where networks are split into sectors trained independently and later combined. This sector-wise backpropagation improves stability, reduces training time, and adapts to standard architectures like CNNs, LSTMs, and Transformers. Evaluated on three Guarani datasets—emotion, humor, and offensive language—our method outperforms traditional Bayesian-optimized training in both accuracy and efficiency.
pdf
bib
abs
Roman Urdu as a Low-Resource Language: Building the First IR Dataset and Baseline
Muhammad Umer Tariq Butt
|
Stalin Varanasi
|
Guenter Neumann
The field of Information Retrieval (IR) increasingly recognizes the importance of inclusivity, yet addressing the needs of low-resource languages, especially those with informal variants, remains a significant challenge. This paper addresses a critical gap in effective IR systems for Roman Urdu, a romanized version of Urdu i.e a language with millions of speakers, widely used in digital communication yet severely underrepresented in research and tooling. Roman Urdu presents unique complexities due to its informality, lack of standardized spelling conventions, and frequent code-switching with English. Crucially, prior to this work, there was a complete absence of any Roman Urdu IR dataset or dedicated retrieval work. To address this critical gap, we present the first-ever large-scale IR MS-marco translated dataset specifically for Roman Urdu, created through a multi-hop pipeline involving English-to-Urdu translation followed by Urdu-to-Roman Urdu transliteration. Using this novel dataset, we train and evaluate a multilingual retrieval model, achieving substantial improvements over traditional lexical retrieval baselines (MRR@10: 0.19 vs. 0.08; Recall@10: 0.332 vs. 0.169). This work lays foundational benchmarks and methodologies for Roman Urdu IR especially using the transformer based models, significantly contributing to inclusive information access and setting the stage for future research in informal, Romanized, and low-resource languages.
pdf
bib
abs
The Brittle Compass: Navigating LLM Prompt Sensitivity in Slovak Migration Media Discourse
Jaroslav Kopčan
|
Samuel Harvan
|
Marek Suppa
In this work, we present a case study that explores various tasks centered around the topic of migration in Slovak, a low-resource language, such as topic relevance and geographical relevance classification, and migration source/destination location term extraction. Our results demonstrate that native (Slovak)prompts yield a modest, task-dependent gain, while large models show significant robustness to prompt variations compared to their smaller counterparts. Analysis reveals that instructions(system or task) emerge as the most critical prompt component, more so than the examples sections, with task-specific performance benefits being more pronounced than overall language effects.
pdf
bib
abs
Explicit Edge Length Coding to Improve Long Sentence Parsing Performance
Khensa Daoudi
|
Mathieu Dehouck
|
Rayan Ziane
|
Natasha Romanova
Performance of syntactic parsers is reduced for longer sentences. While some of this reduction can be explained by the tendency of longer sentences to be more syntactically complex as well as the increase of candidate governor number, some of it is due to longer sentences being more challenging to encode. This is especially relevant for low-resource scenarios such as parsing of written sources in historical languages (e.g. medieval and early-modern European languages), in particular legal texts, where sentences can be very long whereas the amount of training material remains limited. In this paper, we present a new method for explicitly using the arc length information in order to bias the scores produced by a graph-based parser. With a series of experiments on Norman and Gascon data, in which we divide the test data according to sentence length, we show that indeed explicit length coding is beneficial to retain parsing performance for longer sentences.
pdf
bib
abs
Evaluating LLM Capabilities in Low-Resource Contexts: A Case Study of Persian Linguistic and Cultural Tasks
Jasmin Heierli
|
Rebecca Bahar Ganjineh
|
Elena Gavagnin
We evaluate four representative large language models, namely GPT-4o, Gemini, Llama, and DeepSeek on on a suite of linguistic and cultural tasks in Persian, covering grammar, paraphrasing, inference, translation, factual recall, analogical reasoning, and a Hofstede-based cultural probe under direct and role-based prompts. Our findings reveal consistent performance declines, alongside systematic misalignment with Iranian cultural norms. Role-based prompting yields modest improvements but does not fully restore cultural fidelity. We conclude that advancing truly multilingual models demands richer Persian resources, targeted adaptation, and evaluation frameworks that jointly assess fluency and cultural alignment.
pdf
bib
abs
A Benchmark for Evaluating Logical Reasoning in Georgian For Large Language Models
Irakli Koberidze
|
Archil Elizbarashvili
|
Magda Tsintsadze
Advancements in LLMs have largely overlooked low-resource languages (LRLs), creating a gap in evaluation benchmarks. To address this for Georgian, a Kartvelian language, we introduce GeoLogicQA. This novel, manually-curated benchmark assesses LLMs’ logical and inferential reasoning through 100 questions. Questions cover syllogistic deduction, inferential reading comprehension, common-sense reasoning, and arithmetic, adapted from challenging sources (Kangaroo Mathematics Competition) and validated by native Georgian speakers for linguistic nuances. Initial evaluations of state-of-the-art LLMs (Gemini 2.5 Flash, DeepSeek-V3, Grok-3, GPT-4o) show an average accuracy of 64% to 83%, significantly exceeding the human baseline of 47%. While demonstrating strong reasoning potential, error analysis reveals persistent challenges in multi-step combinatorial and highly constrained inferential tasks. GeoLogicQA is a public resource for tracking progress and diagnosing weaknesses in Georgian LLMs. We plan to expand the benchmark and establish a public leader-board to foster continuous improvement.
pdf
bib
abs
Slur and Emoji Aware Models for Hate and Sentiment Detection in Roman Urdu Transgender Discourse
Muhammad Owais Raza
|
Aqsa Umar
|
Mehrub Awan
The rise of social media has amplified both the visibility and vulnerability of marginalized communities, particularly the transgender population in South Asia. While hate speech detection has seen considerable progress in high resource languages like English, under-resourced and code mixed languages such as Roman Urdu remains significantly understudied. This paper presents a novel Roman Urdu dataset derived from Instagram comments on transgender related content, capturing the intricacies of multilingual, code-mixed, and emoji-laden social discourse. We introduce a transphobic slur lexicon specific to Roman Urdu and a semantic emoji taxonomy grounded in contextual usage. These resources are utilized to perform fine-grained classification of sentiment and hate speech using both traditional machine learning models and transformer-based architectures. The findings show that our custom-trained BERT-based models, Senti-RU-Bert and Hate-RU-Bert, best performance, with F1 scores of 80.39% for sentiment classification and 77.34% for hate speech classification. Ablation studies reveal consistent performance gains when slur and emoji features are included.
pdf
bib
abs
Automatic Fact-checking in English and Telugu
Ravi Kiran Chikkala
|
Tatiana Anikina
|
Natalia Skachkova
|
Ivan Vykopal
|
Rodrigo Agerri
|
Josef van Genabith
False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.
pdf
bib
abs
Synthetic Voice Data for Automatic Speech Recognition in African Languages
Brian DeRenzi
|
Anna Dixon
|
Mohamed Aymane Farhi
|
Christian Resch
Speech technology remains out of reach for most of the 2,300+ languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. W2v-BERT 2.0 speech encoder fine-tuned on 250h real and 250h synthetic data in Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved by ~6.5% with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Inves- tigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.
pdf
bib
abs
ADOR: Dataset for Arabic Dialects in Hotel Reviews: A Human Benchmark for Sentiment Analysis
Maram I. Alharbi
|
Saad Ezzini
|
Hansi Hettiarachchi
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Arabic machine translation remains a fundamentally challenging task, primarily due to the lack of comprehensive annotated resources. This study evaluates the performance of Meta’s NLLB-200 model in translating Modern Standard Arabic (MSA) into three regional dialects: Saudi, Maghribi, and Egyptian Arabic using a manually curated dataset of hotel reviews. We applied a multi-criteria human annotation framework to assess translation correctness, dialect accuracy, and sentiment and aspect preservation. Our analysis reveals significant variation in translation quality across dialects. While sentiment and aspect preservation were generally high, dialect accuracy and overall translation fidelity were inconsistent. For Saudi Arabic, over 95% of translations required human correction, highlighting systemic issues. Maghribi outputs demonstrated better dialectal authenticity, while Egyptian translations achieved the highest reliability with the lowest correction rate and fewest multi-criteria failures. These results underscore the limitations of current multilingual models in handling informal Arabic varieties and highlight the importance of dialect-sensitive evaluation.
pdf
bib
abs
Towards Creating a Bulgarian Readability Index
Dimitar Kazakov
|
Stefan Minkov
|
Ruslana Margova
|
Irina Temnikova
|
Ivo Emauilov
Readability assessment plays a crucial role in education and text accessibility. While numerous indices exist for English and have been extended to Romance and Slavic languages, Bulgarian remains under- served in this regard. This paper reviews established readability metrics across these language families, examining their underlying features and modelling methods. We then report the first attempt to develop a readability index for Bulgarian, using end-of-school-year assessment questions and literary works targeted at children of various ages. Key linguistic attributes, namely, word length, sentence length, syllable count, and information content (based on word frequency), were extracted, and their first two statistical moments, mean and variance, were modelled against grade levels using linear and polynomial regression. Results suggest that polynomial models outperform linear ones by capturing non-linear relationships between textual features and perceived difficulty, but may be harder to interpret. This work provides an initial framework for building a reliable readability measure for Bulgarian, with applications in educational text design, adaptive learning, and corpus annotation.