From Sparse to Sense-Grounded: Wikipedia Training for Ukrainian Visual-WSD

Yurii Laba, Rostyslav O. Hryniv


Abstract
Visual Word Sense Disambiguation (Visual-WSD) requires ranking the correct image for an ambiguous word given a short trigger phrase. For low-resource languages, it is bottle­necked by scarce sense-level benchmarks and limited sense-aligned multimodal supervision. We study Ukrainian and (i) extend the Ukrainian Visual-WSD benchmark from 87 to 381 instances and benchmark multilingual CLIP checkpoints and multimodal large models, and (ii) introduce two scalable Wikipedia-derived dataset construction methods. Using compute-efficient adaptation we fine-tune a multilingual CLIP backbone and show that sense-grounded supervision drives the improvements: combining our two Wikipedia-derived datasets improves HIT@1 from 37.00% to 43.05%.
Anthology ID:
2026.conll-main.29
Volume:
Proceedings of the 30th Conference on Computational Natural Language Learning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Claire Bonial, Yevgeni Berzak
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
501–514
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.29/
DOI:
Bibkey:
Cite (ACL):
Yurii Laba and Rostyslav O. Hryniv. 2026. From Sparse to Sense-Grounded: Wikipedia Training for Ukrainian Visual-WSD. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 501–514, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
From Sparse to Sense-Grounded: Wikipedia Training for Ukrainian Visual-WSD (Laba & Hryniv, CoNLL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.29.pdf