Tengfei Cheng


2026

Understanding cultural heritage artifacts such as ancient Greek pottery requires expert-level reasoning that remains challenging for current MLLMs due to limited domain-specific data. We introduce VaseVQA, a benchmark for ancient Greek pottery, primarily vases, consisting of 31,773 images and 67,614 question–answer pairs across seven expert-defined categories, enabling systematic evaluation of expert-level cultural heritage understanding. Using this dataset, we explore effective training strategies for domain-specific reasoning. While supervised fine-tuning improves adaptation to domain knowledge, it struggles with deeper reasoning tasks. We propose VaseVL, which augments SFT with reinforcement learning using verifiable rewards. Experiments show that VaseVL consistently outperforms supervised baselines, especially on reasoning-intensive questions, highlighting the value of targeted reinforcement learning for cultural heritage visual question answering.