Juan Prieto
2026
MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique
BioNLP 2026
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique
BioNLP 2026
Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.
2025
Findings of the AmericasNLP 2025 Shared Tasks on Machine Translation, Creation of Educational Material, and Translation Metrics for Indigenous Languages of the Americas
Ona De Gibert | Robert Pugh | Ali Marashian | Raul Vazquez | Abteen Ebrahimi | Pavel Denisov | Enora Rice | Edward Gow-Smith | Juan Prieto | Melissa Robles | Rubén Manrique | Oscar Moreno | Angel Lino | Rolando Coto-Solano | Aldo Alvarez | Marvin Agüero-Torales | John E. Ortega | Luis Chiruzzo | Arturo Oncevay | Shruti Rijhwani | Katharina Von Der Wense | Manuel Mager
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Ona De Gibert | Robert Pugh | Ali Marashian | Raul Vazquez | Abteen Ebrahimi | Pavel Denisov | Enora Rice | Edward Gow-Smith | Juan Prieto | Melissa Robles | Rubén Manrique | Oscar Moreno | Angel Lino | Rolando Coto-Solano | Aldo Alvarez | Marvin Agüero-Torales | John E. Ortega | Luis Chiruzzo | Arturo Oncevay | Shruti Rijhwani | Katharina Von Der Wense | Manuel Mager
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
This paper presents the findings of the AmericasNLP 2025 Shared Tasks: (1) machine translation for truly low-resource languages, (2) morphological adaptation for generating educational examples, and (3) developing metrics for machine translation in Indigenous languages. The shared tasks cover 14 diverse Indigenous languages of the Americas. A total of 11 teams participated, submitting 26 systems across all tasks, languages, and models. We describe the shared tasks, introduce the datasets and evaluation metrics used, summarize the baselines and submitted systems, and report our findings.
2024
Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation
Juan Prieto | Cristian Martinez | Melissa Robles | Alberto Moreno | Sara Palacios | Rubén Manrique
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Juan Prieto | Cristian Martinez | Melissa Robles | Alberto Moreno | Sara Palacios | Rubén Manrique
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
The use of machine learning and Natural Language Processing (NLP) technologies can assist in the preservation and revitalization of indigenous languages, particularly those classified as “low-resource.” Given the increasing digitization of information, the development of translation tools for these languages is of significant importance. These tools not only facilitate better access to digital resources for indigenous communities but also stimulate language preservation efforts and potentially foster more inclusive, equitable societies, as demonstrated by the AmericasNLP workshop since 2021. The focus of this paper is Colombia, a country home to 65 distinct indigenous languages, presenting a vast spectrum of linguistic characteristics. This cultural and linguistic diversity is an inherent pillar of the nation’s identity, and safeguarding it has been increasingly challenging given the dwindling number of native speakers and the communities’ inclination towards oral traditions. Considering this context, scattered initiatives exist to develop translation systems for these languages. However, these endeavors suffer from a lack of consolidated, comparable data. This paper consolidates a dataset of parallel data in four Colombian indigenous languages - Wayuunaiki, Arhuaco, Inga, and Nasa - gathered from existing digital resources. It also presents the creation of baseline models for future translation and comparison, ultimately serving as a catalyst for incorporating more digital resources progressively.
Search
Fix author
Co-authors
- Rubén Manrique 3
- Melissa Robles 2
- Marvin Agüero-Torales 1
- Aldo Alvarez 1
- Luis Chiruzzo 1
- Rolando Coto-Solano 1
- Pavel Denisov 1
- Abteen Ebrahimi 1
- Lina Gomez Mesa 1
- Edward Gow-Smith 1
- Angel Lino 1
- Manuel Mager 1
- Ali Marashian 1
- Cristian Martinez 1
- Santiago Martinez Novoa 1
- Alberto Moreno 1
- Oscar Moreno 1
- Arturo Oncevay 1
- John E. Ortega 1
- Sara Palacios 1
- Robert Pugh 1
- Enora Rice 1
- Shruti Rijhwani 1
- Raúl Vázquez 1
- Ona de Gibert 1
- Katharina von der Wense 1