Umar Baba Umar

2026

Anchoring the Judge: Curriculum-Based Adaptation and Reference-Anchored MQM for LLM-Based Machine Translation of an Unseen Low-Resource Language - A Case of Nupe
Umar Baba Umar | Sulaimon Adebayo Bashir | Abdulmalik Danlami Mohammed
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Adapting large language models (LLMs) for machine translation has shown strong performance in low-resource languages; however, their effectiveness for unseen, extremely low-resource languages remains largely unexplored. We present NupeMT-QLoRA, a curriculum-based adaptation framework for the Nupe–English language pair. Our approach employs a two-stage QLoRA fine-tuning strategy: (i) initial training on 34k noisy parallel sentence pairs, followed by (ii) continued fine-tuning on a smaller, cleaner set of 12k bidirectional parallel sentences with explicit translation-direction tags. This staged curriculum stabilizes optimization and improves robustness under severe data scarcity.We further identify a reliability crisis in existing automatic evaluation metrics for unseen languages. Popular LLM-based judges such as GEMBA and xCOMET exhibit weak correlation with human judgments (Kendall’s 𝜏 ≈ 0.21) and low inter-rater reliability (Fleiss’ 𝜅 ≈ 0.27), largely due to fluency bias. To address this, we propose Ref-Anchor-MQM, a reference-anchored evaluation protocol that forces the judge to extract Key Semantic Units from a human reference before scoring.Experimental results show that NupeMT-QLoRA substantially outperforms NLLB-200, improving chrF++ from 22.73 to 41.10, while Ref-Anchor-MQM achieves significantly higher alignment with human evaluation (𝜏 = 0.71). Our framework provides a scalable pipeline for adapting and evaluating LLMs on languages with zero prior representation.

pdf bib abs

Thesis Proposal: Self-Adaptive and Epistemic Uncertainty-Guided ASR of Dense Intra-Sentential Code-Switched Speech for African Low-Resource Languages
Umar Baba Umar | Sulaimon Adebayo Bashir | Abdulmalik Danlami Mohammed | Amina Gogo Tafida
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Automatic Speech Recognition (ASR) has achieved strong performance for high-resource languages, but dense intra-sentential code-switched speech in African low-resource settings remains underexplored. Existing multilingual and pretrained ASR systems improve general recognition accuracy, yet they remain weak at switch regions, are sensitive to language imbalance during adaptation, and are typically evaluated with metrics that obscure switching-specific errors. This thesis proposes a self-adaptive and epistemic uncertainty-guided framework for African low-resource code-switched ASR, using Hausa–English (Engausa) and Hausa–Yorùbá as case studies. The work investigates three linked questions: (1) how to design a linguistically informed code-switched corpus with explicit switch-region annotation and labeled/unlabeled partitions for adaptive learning, (2) whether epistemic uncertainty is systematically elevated around switch regions and can guide pseudo-label selection in semi-supervised training, and (3) whether switch-aware adaptation with auxiliary language identification and boundary supervision can reduce recognition errors without increasing catastrophic forgetting. The long-term goal is to develop scalable and data-efficient ASR systems that model code-switching as a structured linguistic phenomenon rather than as noise in multilingual African speech.

Co-authors

Venues

ACL1
LoResLM1

Fix author