Rubén Manrique

Also published as: Ruben Manrique

2026

MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique
BioNLP 2026

Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.

pdf bib abs

Indigenous languages of the Americas face severe endangerment, and the scarcity of culturally grounded resources remains a critical barrier to revitalization efforts. We present the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages, the first shared task dedicated to generating captions for images depicting Indigenous cultures of the Americas, written in the Indigenous languages themselves. To support this, we introduce and publicly release a newly constructed dataset spanning five cultures and their dominant languages: Bribri, Guaraní, Yucatec Maya, Central Veracruz Nahuatl, and Wixárika. Evaluation follows a two-stage process, combining automatic evaluation using ChrF++ with human evaluation of the top-performing systems for each language. Eight teams participate, submitting 27 systems in total. Results indicate that the task remains largely unsolved: while the strongest systems produce understandable captions, they fall short on descriptive detail and, critically, cultural grounding.

2025

pdf bib abs

A Multi-Agent Framework with Diagnostic Feedback for Iterative Plain Language Summary Generation from Cochrane Medical Abstracts
Felipe Arias Russi | Carolina Salazar Lara | Ruben Manrique
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

Plain Language Summaries PLS improve health literacy and enable informed healthcare decisions but writing them requires domain expertise and is time-consuming. Automated methods often prioritize efficiency over comprehension and medical documents unique simplification requirements challenge generic solutions. We present a multi-agent system for generating PLS using Cochrane PLS as proof of concept. The system uses specialized agents for information extraction writing diagnosis and evaluation integrating a medical glossary and statistical analyzer to guide revisions. We evaluated three architectural configurations on 100 Cochrane abstracts using six LLMs both proprietary and open-source. Results reveal model-dependent trade-offs between factuality and readability with the multi-agent approach showing improvements for smaller models and providing operational advantages in control and interpretability.

pdf bib abs

Uniandes at TSAR 2025 Shared Task Multi-Agent CEFR Text Simplification with Automated Quality Assessment and Iterative Refinement
Felipe Arias Russi | Kevin Cohen Solano | Ruben Manrique
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

We present an agent-based system for the TSAR 2025 Shared Task on Readability-Controlled Text Simplification, which requires simplifying English paragraphs from B2+ levels to target A2 or B1 levels while preserving meaning. Our approach employs specialized agents for keyword extraction, text generation, and evaluation, coordinated through an iterative refinement loop. The system integrates a CEFR vocabulary classifier, pretrained evaluation models, and few-shot learning from trial data. Through iterative feedback between the evaluator and writer agents, our system automatically refines outputs until they meet both readability and semantic preservation constraints. This architecture achieved 4th position among participating teams, showing the effectiveness of combining specialized LLMs with automated quality control strategies for text simplification.

pdf bib abs

Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish
Kevin Cohen | Laura Manrique-Gómez | Ruben Manrique
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

pdf bib abs

A Scalable Framework for Legal Text Understanding in Regulatory and Financial Contexts.
Santiago Martínez | Juan Manuel Castañeda | Ruben Manrique
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

This study presents a comprehensive approach to developing a domain-specific large language model (LLM) for regulatory and financial text interpretation. A specialized corpus was constructed through large-scale scraping of financial and regulatory documents across domains such as compliance, licensing, and financial reporting. The data was preprocessed using GPT-4o-mini with prompt engineering to retain critical information and remove noise. We further pre-trained a LLaMA-3.1-8B model on the curated corpus and fine-tuned it using an instruction dataset covering nine tasks from the Coling 2025 Regulations Challenge, including acronym expansion, regulatory question-answering, and XBRL-based financial analytics, employing QLoRA to reduce memory requirements. The model exhibits a slight improvement from baseline answering complex regulatory questions (detailed QA) and expanding acronyms. This study demonstrates the potential of domain-specific LLMs in regulatory text interpretation and lays the groundwork for future research in specialized NLP evaluation methodologies.

pdf bib

Bridging the Gap in Health Literacy: Harnessing the Power of Large Language Models to Generate Plain Language Summaries from Biomedical Texts
Felipe Arias-Russi | Carolina Salazar-Lara | Rubén Manrique
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

This paper presents the findings of the AmericasNLP 2025 Shared Tasks: (1) machine translation for truly low-resource languages, (2) morphological adaptation for generating educational examples, and (3) developing metrics for machine translation in Indigenous languages. The shared tasks cover 14 diverse Indigenous languages of the Americas. A total of 11 teams participated, submitting 26 systems across all tasks, languages, and models. We describe the shared tasks, introduce the datasets and evaluation metrics used, summarize the baselines and submitted systems, and report our findings.

2024

pdf bib abs

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gomez | Tony Montes | Arturo Rodriguez Herrera | Ruben Manrique
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

pdf bib

Historical Ink: Semantic Shift Detection for 19th Century Spanish
Tony Montes | Laura Manrique-Gómez | Rubén Manrique
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

pdf bib abs

Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation
Juan Prieto | Cristian Martinez | Melissa Robles | Alberto Moreno | Sara Palacios | Rubén Manrique
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

The use of machine learning and Natural Language Processing (NLP) technologies can assist in the preservation and revitalization of indigenous languages, particularly those classified as “low-resource.” Given the increasing digitization of information, the development of translation tools for these languages is of significant importance. These tools not only facilitate better access to digital resources for indigenous communities but also stimulate language preservation efforts and potentially foster more inclusive, equitable societies, as demonstrated by the AmericasNLP workshop since 2021. The focus of this paper is Colombia, a country home to 65 distinct indigenous languages, presenting a vast spectrum of linguistic characteristics. This cultural and linguistic diversity is an inherent pillar of the nation’s identity, and safeguarding it has been increasingly challenging given the dwindling number of native speakers and the communities’ inclination towards oral traditions. Considering this context, scattered initiatives exist to develop translation systems for these languages. However, these endeavors suffer from a lack of consolidated, comparable data. This paper consolidates a dataset of parallel data in four Colombian indigenous languages - Wayuunaiki, Arhuaco, Inga, and Nasa - gathered from existing digital resources. It also presents the creation of baseline models for future translation and comparison, ultimately serving as a catalyst for incorporating more digital resources progressively.