PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation

Alfonso Manuel Paredes Umeres, Marco Antonio Sobrevilla Cabezudo


Abstract
We present PictoEduca, the first large-scale Spanish text-to-pictogram dataset for augmentative and alternative communication (AAC), derived from primary educational materials and grounded in the ARASAAC pictogram repository. The dataset is released with a reproducible pipeline that combines automatic annotation with targeted expert correction, supporting scalable and high-quality corpus construction. We benchmark a rule-based system (ARAWORD) and neural models (T5, LLaMA) under direct text-to-pictogram and two-stage text-to-concept-to-pictogram settings. Results show that the rule-based system remains a strong baseline, while neural models benefit from explicit semantic abstraction, with the two-stage approach improving semantic coherence and reducing ambiguity. We further explore data selection strategies, demonstrating that combining domain similarity with a quality signal yields higher-quality silver data, reduces annotation effort, and improves model performance in low-resource regimes. PictoEduca enables reproducible evaluation and advances Spanish text-to-pictogram research.
Anthology ID:
2026.findings-acl.1738
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34816–34828
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1738/
DOI:
Bibkey:
Cite (ACL):
Alfonso Manuel Paredes Umeres and Marco Antonio Sobrevilla Cabezudo. 2026. PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34816–34828, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation (Umeres & Cabezudo, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1738.pdf
Checklist:
 2026.findings-acl.1738.checklist.pdf