PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation
Alfonso Manuel Paredes Umeres, Marco Antonio Sobrevilla Cabezudo
Abstract
We present PictoEduca, the first large-scale Spanish text-to-pictogram dataset for augmentative and alternative communication (AAC), derived from primary educational materials and grounded in the ARASAAC pictogram repository. The dataset is released with a reproducible pipeline that combines automatic annotation with targeted expert correction, supporting scalable and high-quality corpus construction. We benchmark a rule-based system (ARAWORD) and neural models (T5, LLaMA) under direct text-to-pictogram and two-stage text-to-concept-to-pictogram settings. Results show that the rule-based system remains a strong baseline, while neural models benefit from explicit semantic abstraction, with the two-stage approach improving semantic coherence and reducing ambiguity. We further explore data selection strategies, demonstrating that combining domain similarity with a quality signal yields higher-quality silver data, reduces annotation effort, and improves model performance in low-resource regimes. PictoEduca enables reproducible evaluation and advances Spanish text-to-pictogram research.- Anthology ID:
- 2026.findings-acl.1738
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34816–34828
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1738/
- DOI:
- Cite (ACL):
- Alfonso Manuel Paredes Umeres and Marco Antonio Sobrevilla Cabezudo. 2026. PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34816–34828, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation (Umeres & Cabezudo, Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1738.pdf