FTibSuite: A Comprehensive Resource Suite for Tibetan Vision–Language Modeling

Guixian Xu; Yide Liang; Zeli Su; Xuexian Song; Ziyin Zhang; Yushuang Dong; Ting Zhang; Xu Han

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision–Language Modeling

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

Abstract

Vision–language models (VLMs) have progressed rapidly, but Tibetan remains largely underserved due to the lack of infrastructure for reproducible training and evaluation. To help address this gap, we introduce FTibSuite, a resource-centric foundation for Tibetan VLM research that provides an end-to-end training-and-evaluation workflow and includes human-verified multimodal annotations, partially filling a long-standing shortage of Tibetan multimodal resources. FTibSuite comprises FTibData, FTibBench, and a reproducible baseline model, FTibVLM, built on Qwen3-VL-8B-Instruct. FTibVLM adopts a three-stage adaptation pipeline consisting of Tibetan continual pretraining, image–text alignment, and multimodal instruction tuning. For systematic evaluation, FTibBench adapts five established multimodal benchmarks to Tibetan and offers a reproducible evaluation protocol to support consistent comparisons across models. Specifically, FTibBench includes Tibetan versions of MMBench, MME, POPE, BinaryVQA, and COREVQA. Experiments on FTibBench demonstrate that FTibVLM consistently improves Tibetan multimodal performance. For instance, FTibVLM attains 76.01 accuracy on BinaryVQA, indicating that Tibetan performance can be competitive with high-resource settings on this diagnostic task. We also observe substantial gains on other benchmarks, including an improvement on MMBench (dev) from 42.97 to 67.78 and an increase in POPE-random accuracy from 47.53 to 80.56, underscoring the practical value of the proposed workflow and resources.

Anthology ID:: 2026.findings-acl.903
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18143–18159
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.903/
DOI:
Bibkey:
Cite (ACL):: Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, and Xu Han. 2026. FTibSuite: A Comprehensive Resource Suite for Tibetan Vision–Language Modeling. In Findings of the Association for Computational Linguistics: ACL 2026, pages 18143–18159, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: FTibSuite: A Comprehensive Resource Suite for Tibetan Vision–Language Modeling (Xu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.903.pdf
Checklist:: 2026.findings-acl.903.checklist.pdf

PDF Cite Search Checklist Fix data