Xuexian Song


2026

Vision–language models (VLMs) have progressed rapidly, but Tibetan remains largely underserved due to the lack of infrastructure for reproducible training and evaluation. To help address this gap, we introduce FTibSuite, a resource-centric foundation for Tibetan VLM research that provides an end-to-end training-and-evaluation workflow and includes human-verified multimodal annotations, partially filling a long-standing shortage of Tibetan multimodal resources. FTibSuite comprises FTibData, FTibBench, and a reproducible baseline model, FTibVLM, built on Qwen3-VL-8B-Instruct. FTibVLM adopts a three-stage adaptation pipeline consisting of Tibetan continual pretraining, image–text alignment, and multimodal instruction tuning. For systematic evaluation, FTibBench adapts five established multimodal benchmarks to Tibetan and offers a reproducible evaluation protocol to support consistent comparisons across models. Specifically, FTibBench includes Tibetan versions of MMBench, MME, POPE, BinaryVQA, and COREVQA. Experiments on FTibBench demonstrate that FTibVLM consistently improves Tibetan multimodal performance. For instance, FTibVLM attains 76.01 accuracy on BinaryVQA, indicating that Tibetan performance can be competitive with high-resource settings on this diagnostic task. We also observe substantial gains on other benchmarks, including an improvement on MMBench (dev) from 42.97 to 67.78 and an increase in POPE-random accuracy from 47.53 to 80.56, underscoring the practical value of the proposed workflow and resources.
Extending large language models (LLMs) to low-resource languages often incurs an “align- ment tax”: improvements in the target lan- guage come at the cost of catastrophic forget- ting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimiza- tion (GRPO), where the model is optimized us- ing embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flex- ible realizations, enabling controlled updates that reduce destructive interference with pre- trained knowledge. We evaluate our approach on Tibetan–Chinese machine translation and Ti- betan headline generation. Experiments show that our method acquires low-resource capa- bilities while markedly mitigating alignment tax, preserving general competence more effec- tively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher se- mantic quality and preference in open-ended generation, and few-shot transfer results indi- cate that it learns more transferable and ro- bust representations under limited supervision. Overall, our study demonstrates that reinforce- ment learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.