Fuyu Xing
2026
MEUR: A Benchmark for Evaluating Vision-Language Models on Multimodal Event Understanding and Reasoning
Zimu Wang | Yuqi Wang | Tong Chen | Changyu Zeng | Hongbin Na | Nijia Han | Fuyu Xing | Qi Chen | Qiufeng Wang | Anh Nguyen | Shuihua Wang | Ling Chen | Jionglong Su | Haiyang Zhang | Wei Wang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Zimu Wang | Yuqi Wang | Tong Chen | Changyu Zeng | Hongbin Na | Nijia Han | Fuyu Xing | Qi Chen | Qiufeng Wang | Anh Nguyen | Shuihua Wang | Ling Chen | Jionglong Su | Haiyang Zhang | Wei Wang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Event understanding and reasoning play critical roles in thoroughly evaluating the capabilities of Vision-Language Models (VLMs); however, existing Visual Question Answering (VQA) datasets predominantly focus on entity-centric questions, while event- or action-related questions are limited in scale and suffer from significant shortcut issues. We introduce MEUR, the first Multimodal Event Understanding and Reasoning dataset consisting of 1,200 images and 4,217 questions, necessitating VLMs with a diverse range of multimodal understanding and reasoning capabilities to answer, ranging from basic event recognition to more complex tasks such as counting and comparison. To streamline the annotation process, we propose a novel semi-automated pipeline that combines advanced VLMs with human annotators, achieving high quality and efficiency. We conduct extensive experiments on state-of-the-art non-thinking and thinking VLMs to demonstrate their capabilities and limitations in multimodal event understanding and reasoning. Furthermore, we provide a detailed error analysis that points out promising directions for future research.
2025
Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Fuyu Xing | Zimu Wang | Wei Wang | Haiyang Zhang
Proceedings of the 18th International Natural Language Generation Conference
Fuyu Xing | Zimu Wang | Wei Wang | Haiyang Zhang
Proceedings of the 18th International Natural Language Generation Conference
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M²E²) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M²E² task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M²E² dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M²E² capabilities.
Unveiling the Linguistic Acceptability Judgments of Large Language Models in Multilingual Contexts
Fuyu Xing | Haoyu Huang | Dawei Mo | Xinzhuo Yang | Zixuan Gao | Wei Wang | Zimu Wang | Haiyang Zhang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Fuyu Xing | Haoyu Huang | Dawei Mo | Xinzhuo Yang | Zixuan Gao | Wei Wang | Zimu Wang | Haiyang Zhang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
"Linguistic acceptability judgments are essential for evaluating how language models internalize human-like grammatical knowledge. Though some studies have evaluated large language mod-els (LLMs) in this context, existing research lacks systematic exploration of diverse learning paradigms in a multilingual setting. In this paper, we present the first multilingual evaluation of LLMs across four languages (English, Chinese, Japanese, and Russian) in the field of linguistic acceptability. Our evaluation spans both general-purpose (i.e., GPT-4o, GPT-4o mini,DeepSeek-V3, GLM-4-32B, and the Qwen series) and reasoning-oriented (QwQ-32B-Preview and DeepSeek-R1-32B) models under zero-shot and monolingual, cross-lingual and multilingual fine-tuning settings, with comparisons to pre-trained language model (PLM) baselines. Our analysis highlights the strong generalizability of large-scale LLMs through zero-shot prompting, the challenges of fine-tuning small-sized LLMs with skewed training data, the effectiveness of multilingual fine-tuning for low-resource languages, the scaling law exhibited on the task, and the limitation of reasoning-oriented models on the task, even when “aha moments” occur during the reasoning process."