Jared Coleman

2026

Comparing LLM-Based Translation Approaches for Extremely Low-Resource Languages
Jared Coleman | Ruben Rosales | Kira Toal | Diego Cuadros | Nicholas Leeds | Bhaskar Krishnamachari | Khalil Iskarous
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

We present a comprehensive evaluation and extension of the LLM-Assisted Rule-Based Machine Translation (LLM-RBMT) paradigm, an approach that combines the strengths of rule-based methods and Large Language Models (LLMs) to support translation in no-resource settings. We present a robust new implementation (the Pipeline Translator) that generalizes the LLM-RBMT approach and enables flexible adaptation to novel constructions. We benchmark it against four alternatives (Builder, Instructions, RAG, and Fine-tuned translators) on a curated dataset of 150 English sentences, and compare them across translation quality and runtime. The Pipeline Translator consistently achieves the best overall performance. The LLM-RBMT methods (Pipeline and Builder) also offer an important advantage: they naturally align with evaluation strategies that prioritize grammaticality and semantic fidelity over surface-form overlap, which is critical for endangered languages where mistranslation carries high risk.

pdf bib abs

Schema-Constrained Image Captioning for Five Low-Resource Indigenous Languages
Diego Cuadros | Nicholas Leeds | Amanda Avalos | Azul Alpizar-Velazquez | Jared Coleman | Faezeh Dehghan Tarzjani | Bhaskar Krishnamachari
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

We describe our submission to all five tracks of the AmericasNLP 2026 Shared Task on Cultural Image Captioning: Bribri, Guaraní, Yucatec Maya, Orizaba Nahuatl, and Wixárika. Our system is an LLM-assisted rule-based machine translation (LLM-RBMT) captioner. For each language, a coding agent reads the small development split and open-web linguistic references and writes a complete Pydantic grammar package with a closed vocabulary. At inference time, a vision–language model sees the image and the schema, emits a structured SentenceList under constrained decoding, and a deterministic Python renderer produces the surface string. The model never generates target-language tokens. The same architecture handles all five languages with no fine-tuning, no parallel corpora, and no human edits to the generated packages. On the official test set, the system placed first on human evaluation in Bribri and Orizaba Nahuatl, third on Yucatec Maya, and first on ChrF++ in Yucatec Maya. We suggest that a strength of the approach is that outputs are restricted to simple sentences that are grammatically correct by construction, modulo the correctness of the generated grammar itself.

pdf bib abs

RAN: Resource Abundance Notation for Languages in NLP
Jared Coleman | Tainã Coleman | Bhaskar Krishnmachari
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

The term "low-resource" is used pervasively in NLP but communicates almost nothing precise. We propose RAN (Resource Abundance Notation), a compact, multi-dimensional notation for quantifying a language’s NLP resource profile. A RAN score is written as S/M/L_1-B_1/L_2-B_2/..., where S = floor(log10(speakers)), M = floor(log10(monolingual sentences)), and each L_i-B_i pair records a bilingual partner and floor(log10(parallel sentences)). Values derive from canonical sources: Wikidata for speakers, OSCAR 23.01 for monolingual corpora, and (where available) OPUS for parallel corpora. We score 20 typologically diverse languages and correlate each profile against published benchmarks for three tasks: machine translation (MT, via NLLB-200 chrF++), named entity recognition (NER, via XTREME XLM-R WikiANN F1), and part-of-speech tagging (POS, via XTREME XLM-R UD accuracy). The RAN components carry complementary information: a linear model using all three explains 52% of MT variance, 76% of NER variance, and 72% of POS variance. Among single predictors, B_max (the largest bilingual corpus, regardless of partner) is strongest for the cross-lingual transfer tasks (NER, POS), while M and B_en are strongest for MT. RAN is designed first as a communication tool, not a predictive model.

2024

pdf bib abs

LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages
Jared Coleman | Bhaskar Krishnamachari | Ruben Rosales | Khalil Iskarous
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator’s components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.

Co-authors

Azul Alpizar-Velazquez 1

Amanda Avalos 1

Tainã Coleman 1

Faezeh Dehghan Tarzjani 1

Bhaskar Krishnmachari 1

Kira Toal 1

Venues

Fix author