Wenmeng Yu
2026
PlotGen-Bench: Evaluating VLMs on Generating Visualization Code from Diverse Plots across Multiple Libraries
Yi Zhao | Zhen Yang | Shuaiqi Duan | Wenmeng Yu | Zhe Su | Jibing Gong | Jie Tang
Findings of the Association for Computational Linguistics: ACL 2026
Yi Zhao | Zhen Yang | Shuaiqi Duan | Wenmeng Yu | Zhe Su | Jibing Gong | Jie Tang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in vision–language models (VLMs) have expanded their multimodal code generation capabilities, yet their ability to generate executable visualization code from plots, especially for complex 3D, animated, plot-to-plot transformations, or multi-library scenarios, remains underexplored. To address this gap, we introduce PlotGen-Bench, a comprehensive benchmark for evaluating plot-to-code generation under realistic and complex visualization scenarios. The benchmark spans 9 major categories, 30 subcategories, and 3 core tasks—plot replication, plot transformation, and multi-library generation, covering both 2D, 3D and animated plots across 5 widely used visualization libraries. Through systematic evaluation of state-of-the-art open- and closed-source VLMs, we find that open-source models still lag considerably behind in visual fidelity and semantic consistency, despite achieving comparable code executability. Moreover, all models exhibit substantial degradation on reasoning-intensive tasks such as chart type conversion and animation generation. PlotGen-Bench establishes a rigorous foundation for advancing research toward more capable and reliable VLMs for visualization authoring and code synthesis, with all data and code available at https://plotgen.github.io.
2025
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
Yuhang Wu | Wenmeng Yu | Yean Cheng | Yan Wang | Xiaohan Zhang | Jiazheng Xu | Ming Ding | Yuxiao Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhang Wu | Wenmeng Yu | Yean Cheng | Yan Wang | Xiaohan Zhang | Jiazheng Xu | Ming Ding | Yuxiao Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, which provides more nuanced evaluations of alignment capabilities and is the first benchmark specifically designed for Chinese visual contexts. This benchmark is meticulously curated from real-world scenarios and internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we develop CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4’s evaluation ability. Additionally, we measure the “alignment score”, a quantitative metric designed to assess the robustness and stability of models across diverse prompts. Finally, we evaluate the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. The evaluation code and data are available at https://github.com/THUDM/AlignMMBench.
2022
M-SENA: An Integrated Platform for Multimodal Sentiment Analysis
Huisheng Mao | Ziqi Yuan | Hua Xu | Wenmeng Yu | Yihe Liu | Kai Gao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Huisheng Mao | Ziqi Yuan | Hua Xu | Wenmeng Yu | Yihe Liu | Kai Gao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
M-SENA is an open-sourced platform for Multimodal Sentiment Analysis. It aims to facilitate advanced research by providing flexible toolkits, reliable benchmarks, and intuitive demonstrations. The platform features a fully modular video sentiment analysis framework consisting of data management, feature extraction, model training, and result analysis modules. In this paper, we first illustrate the overall architecture of the M-SENA platform and then introduce features of the core modules. Reliable baseline results of different modality features and MSA benchmarks are also reported. Moreover, we use model evaluation and analysis tools provided by M-SENA to present intermediate representation visualization, on-the-fly instance test, and generalization ability test results. The source code of the platform is publicly available at https://github.com/thuiar/M-SENA.
2020
CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality
Wenmeng Yu | Hua Xu | Fanyang Meng | Yilin Zhu | Yixiao Ma | Jiele Wu | Jiyun Zou | Kaicheng Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Wenmeng Yu | Hua Xu | Fanyang Meng | Yilin Zhu | Yixiao Ma | Jiele Wu | Jiyun Zou | Kaicheng Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Previous studies in multimodal sentiment analysis have used limited datasets, which only contain unified multimodal annotations. However, the unified annotations do not always reflect the independent sentiment of single modalities and limit the model to capture the difference between modalities. In this paper, we introduce a Chinese single- and multi-modal sentiment analysis dataset, CH-SIMS, which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations. It allows researchers to study the interaction between modalities or use independent unimodal annotations for unimodal sentiment analysis. Furthermore, we propose a multi-task learning framework based on late fusion as the baseline. Extensive experiments on the CH-SIMS show that our methods achieve state-of-the-art performance and learn more distinctive unimodal representations. The full dataset and codes are available for use at https://github.com/thuiar/MMSA.