Ya-Ting Pai
2026
VisTW: Benchmarking Vision-Language Models for Taiwanese Mandarin in Taiwan
Zhi Rui Tam | Yung-Yu Shih | Yen-Wei Lee | Ya-Ting Pai | Wen Yu Chang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2026
Zhi Rui Tam | Yung-Yu Shih | Yen-Wei Lee | Ya-Ting Pai | Wen Yu Chang | Yun-Nung Chen
Findings of the Association for Computational Linguistics: ACL 2026
Vision-Language Models (VLMs) often struggle in Taiwanese Mandarin environments due to region-specific orthographic and cultural context. We introduce VisTW, a comprehensive benchmark featuring (i) multiple-choice questions (3,795 academic questions) and (ii) free-form generation evaluation (141 Taiwanese-context free-form pairs). Beyond standard accuracy, we investigate character mixing— the unintended production of Simplified Chinese characters under Taiwanese-Mandarin-style prompts—and propose a human-grounded purity penalty derived from perceptual thresholds measured from users. Our evaluation reveals substantial character contamination (3%–19%) across state-of-the-art VLMs. We find that Gemini-3-Pro significantly outperforms the strongest open-weight baseline, Qwen3 235B MoE, by up to 22 percentage points on dialogue tasks once the purity penalty is applied. These results highlight orthographic consistency as a vital, yet overlooked, dimension for localized multimodal evaluation and deployment.