MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP

Santiago Martinez Novoa, Lina Gomez Mesa, Juan Prieto, Ruben Manrique


Abstract
Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.
Anthology ID:
2026.bionlp-1.49
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
617–629
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.49/
DOI:
Bibkey:
Cite (ACL):
Santiago Martinez Novoa, Lina Gomez Mesa, Juan Prieto, and Ruben Manrique. 2026. MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP. In BioNLP 2026, pages 617–629, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP (Martinez Novoa et al., BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.49.pdf